ProxyPrompt: Securing System Prompts against Prompt Extraction Attacks

Hui-Po Wang; Maria-Irina Nicolae; Mario Fritz; Zhixiong Zhuang

arxiv: 2505.11459 · v2 · submitted 2025-05-16 · 💻 cs.CR

ProxyPrompt: Securing System Prompts against Prompt Extraction Attacks

Zhixiong Zhuang , Maria-Irina Nicolae , Hui-Po Wang , Mario Fritz This is my paper

Pith reviewed 2026-05-22 14:02 UTC · model grok-4.3

classification 💻 cs.CR

keywords system promptsprompt extraction attacksLLM securitydefense mechanismprompt obfuscationadversarial robustnessutility preservation

0 comments

The pith

ProxyPrompt swaps sensitive system prompts for proxies that preserve task performance but block attackers from recovering originals or details.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ProxyPrompt to defend against prompt extraction attacks on large language models. It replaces the original system prompt with a proxy version that performs the same task without exposing sensitive information or filtering criteria. Tests across 264 LLM and prompt pairs show this approach protects 94.70 percent of prompts from successful extraction, compared to 42.80 percent for the next best existing method. The method targets the risk that well-crafted prompts encode domain expertise and private rules that attackers could misuse. If the defense works as described, applications could safely deploy complex prompts without relying on defenses that must be constantly revised.

Core claim

ProxyPrompt prevents prompt leakage by replacing the original prompt with a proxy that maintains the original task's utility while obfuscating the extracted prompt, ensuring attackers cannot reproduce the task or access sensitive information. Comprehensive evaluations on 264 LLM and system prompt pairs show that ProxyPrompt protects 94.70% of prompts from extraction attacks, outperforming the next-best defense, which only achieves 42.80%.

What carries the argument

The proxy prompt, a constructed substitute for the original system prompt that delivers equivalent task utility while rendering any extracted version unable to reproduce the original task or expose sensitive details.

If this is right

Attackers extracting the proxy cannot reproduce the original task or access encoded sensitive information.
Complex expert-crafted system prompts can be deployed in applications without direct leakage risks.
Defenses against extraction no longer require constant updates to match evolving attack techniques.
Sensitive filtering criteria and domain rules stay protected even if prompts are targeted.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Proxy generation could be automated to scale the defense across many different tasks and models.
The approach might layer with other security techniques such as output filtering for stronger protection.
It could apply to agent-based or multi-turn LLM interactions where prompts evolve over time.

Load-bearing premise

A proxy prompt can be constructed that delivers equivalent task performance to the original while ensuring that any extracted output fails to let attackers reproduce the task or recover sensitive information.

What would settle it

Demonstrating an extraction attack that recovers the original prompt or enough details from the proxy to fully reproduce the task and sensitive criteria would disprove the defense.

read the original abstract

The integration of large language models (LLMs) into a wide range of applications has highlighted the critical role of well-crafted system prompts, which require extensive testing and domain expertise. These prompts enhance task performance but may also encode sensitive information and filtering criteria, posing security risks if exposed. Recent research shows that system prompts are vulnerable to extraction attacks, while existing defenses are either easily bypassed or require constant updates to address new threats. In this work, we introduce ProxyPrompt, a novel defense mechanism that prevents prompt leakage by replacing the original prompt with a proxy. This proxy maintains the original task's utility while obfuscating the extracted prompt, ensuring attackers cannot reproduce the task or access sensitive information. Comprehensive evaluations on 264 LLM and system prompt pairs show that ProxyPrompt protects 94.70% of prompts from extraction attacks, outperforming the next-best defense, which only achieves 42.80%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ProxyPrompt, a defense that replaces original system prompts with proxy prompts designed to preserve task utility while obfuscating the prompt to prevent extraction attacks from allowing attackers to reproduce the task or access sensitive information. Evaluations across 264 LLM and system prompt pairs report that ProxyPrompt protects 94.70% of prompts, outperforming the next-best defense at 42.80%.

Significance. If the central empirical claims hold under closer scrutiny of the evaluation protocol, ProxyPrompt could offer a practical, update-light defense for securing sensitive system prompts in deployed LLM applications. The scale of the 264-pair evaluation is a strength that provides broader coverage than typical prompt-security studies.

major comments (2)

[Abstract and Evaluation section] Abstract and Evaluation section: the 94.70% protection figure is presented as evidence that attackers 'cannot reproduce the task,' yet it is unclear whether extraction success is defined strictly as recovering the exact original prompt string or as enabling an attacker to reproduce the original task behavior or sensitive criteria via follow-up queries to the proxy. This distinction is load-bearing for the security claim because functional equivalence is required for utility but directly tensions with preventing reproduction.
[Evaluation section] Evaluation section: the manuscript lacks sufficient detail on attack implementations, baseline constructions, and statistical controls (e.g., variance across runs or significance testing), which prevents verification of the reported performance gap over the 42.80% baseline.

minor comments (2)

[Method section] Clarify the exact procedure for constructing the proxy prompt and any hyperparameters involved, as these appear central to reproducibility.
[Results] Ensure tables reporting the 264-pair results include per-category breakdowns or confidence intervals to strengthen the aggregate 94.70% claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify key aspects of our evaluation protocol. We address each major comment below and have revised the manuscript to incorporate the requested clarifications and details.

read point-by-point responses

Referee: [Abstract and Evaluation section] Abstract and Evaluation section: the 94.70% protection figure is presented as evidence that attackers 'cannot reproduce the task,' yet it is unclear whether extraction success is defined strictly as recovering the exact original prompt string or as enabling an attacker to reproduce the original task behavior or sensitive criteria via follow-up queries to the proxy. This distinction is load-bearing for the security claim because functional equivalence is required for utility but directly tensions with preventing reproduction.

Authors: We appreciate the referee pointing out this critical distinction in our threat model. In ProxyPrompt, extraction success is defined as the attacker being able to reproduce the original task behavior or access sensitive criteria (e.g., filtering rules) via follow-up queries when using the extracted proxy prompt. This is distinct from exact string recovery of the original prompt. The proxy is designed to preserve utility for legitimate use while obfuscating the prompt such that extracted versions do not enable functional reproduction of the task or sensitive behaviors by an adversary. We have revised the Abstract and Evaluation sections to explicitly state this definition of extraction success and how the 94.70% protection rate was measured under this criterion, resolving the noted tension. revision: yes
Referee: [Evaluation section] Evaluation section: the manuscript lacks sufficient detail on attack implementations, baseline constructions, and statistical controls (e.g., variance across runs or significance testing), which prevents verification of the reported performance gap over the 42.80% baseline.

Authors: We agree that greater detail is required for reproducibility and to substantiate the performance claims. In the revised manuscript, we have expanded the Evaluation section with: (i) full specifications of the attack implementations, including query templates and prompting strategies employed; (ii) precise descriptions of how each baseline defense was constructed and parameterized; and (iii) statistical controls, including standard deviations across repeated runs and results of significance testing (paired t-tests) confirming the gap versus the 42.80% baseline is statistically significant. These additions enable independent verification of the results. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces ProxyPrompt as an empirical defense that replaces system prompts with proxies and validates the approach through direct experiments on 264 LLM-prompt pairs, reporting a 94.70% protection rate. No equations, parameter fits presented as predictions, self-referential definitions, or load-bearing self-citation chains appear in the provided text. The central security claim rests on reported experimental outcomes against extraction attacks rather than any derivation that reduces to its own inputs by construction, making the work self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The defense rests on the domain assumption that task utility can be preserved under prompt substitution without exposing original content; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption LLM behavior on a task remains functionally equivalent when the system prompt is replaced by a suitably designed proxy.
Required for the utility-maintenance claim in the abstract.

invented entities (1)

Proxy prompt no independent evidence
purpose: Obfuscate sensitive system prompt content while preserving task performance.
New construct introduced to achieve the defense goal.

pith-pipeline@v0.9.0 · 5685 in / 1186 out tokens · 37874 ms · 2026-05-22T14:02:50.947755+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

arg min ˜ϕP [ (1/|Q|) Σ L(fϕP(ϕQ), f˜ϕP(ϕQ)) + L(f˜ϕP||ϕP′(ϕQ′), ˜P) ] (Eq. 3); continuous-to-discrete gap quantified by cosine similarity to nearest vocabulary tokens
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Semantic-Match (SM) and Most-Similar (MS) metrics using entailment and sentence embeddings to detect rephrased leaks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.