ProxyPrompt: Securing System Prompts against Prompt Extraction Attacks
Pith reviewed 2026-05-22 14:02 UTC · model grok-4.3
The pith
ProxyPrompt swaps sensitive system prompts for proxies that preserve task performance but block attackers from recovering originals or details.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ProxyPrompt prevents prompt leakage by replacing the original prompt with a proxy that maintains the original task's utility while obfuscating the extracted prompt, ensuring attackers cannot reproduce the task or access sensitive information. Comprehensive evaluations on 264 LLM and system prompt pairs show that ProxyPrompt protects 94.70% of prompts from extraction attacks, outperforming the next-best defense, which only achieves 42.80%.
What carries the argument
The proxy prompt, a constructed substitute for the original system prompt that delivers equivalent task utility while rendering any extracted version unable to reproduce the original task or expose sensitive details.
If this is right
- Attackers extracting the proxy cannot reproduce the original task or access encoded sensitive information.
- Complex expert-crafted system prompts can be deployed in applications without direct leakage risks.
- Defenses against extraction no longer require constant updates to match evolving attack techniques.
- Sensitive filtering criteria and domain rules stay protected even if prompts are targeted.
Where Pith is reading between the lines
- Proxy generation could be automated to scale the defense across many different tasks and models.
- The approach might layer with other security techniques such as output filtering for stronger protection.
- It could apply to agent-based or multi-turn LLM interactions where prompts evolve over time.
Load-bearing premise
A proxy prompt can be constructed that delivers equivalent task performance to the original while ensuring that any extracted output fails to let attackers reproduce the task or recover sensitive information.
What would settle it
Demonstrating an extraction attack that recovers the original prompt or enough details from the proxy to fully reproduce the task and sensitive criteria would disprove the defense.
read the original abstract
The integration of large language models (LLMs) into a wide range of applications has highlighted the critical role of well-crafted system prompts, which require extensive testing and domain expertise. These prompts enhance task performance but may also encode sensitive information and filtering criteria, posing security risks if exposed. Recent research shows that system prompts are vulnerable to extraction attacks, while existing defenses are either easily bypassed or require constant updates to address new threats. In this work, we introduce ProxyPrompt, a novel defense mechanism that prevents prompt leakage by replacing the original prompt with a proxy. This proxy maintains the original task's utility while obfuscating the extracted prompt, ensuring attackers cannot reproduce the task or access sensitive information. Comprehensive evaluations on 264 LLM and system prompt pairs show that ProxyPrompt protects 94.70% of prompts from extraction attacks, outperforming the next-best defense, which only achieves 42.80%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ProxyPrompt, a defense that replaces original system prompts with proxy prompts designed to preserve task utility while obfuscating the prompt to prevent extraction attacks from allowing attackers to reproduce the task or access sensitive information. Evaluations across 264 LLM and system prompt pairs report that ProxyPrompt protects 94.70% of prompts, outperforming the next-best defense at 42.80%.
Significance. If the central empirical claims hold under closer scrutiny of the evaluation protocol, ProxyPrompt could offer a practical, update-light defense for securing sensitive system prompts in deployed LLM applications. The scale of the 264-pair evaluation is a strength that provides broader coverage than typical prompt-security studies.
major comments (2)
- [Abstract and Evaluation section] Abstract and Evaluation section: the 94.70% protection figure is presented as evidence that attackers 'cannot reproduce the task,' yet it is unclear whether extraction success is defined strictly as recovering the exact original prompt string or as enabling an attacker to reproduce the original task behavior or sensitive criteria via follow-up queries to the proxy. This distinction is load-bearing for the security claim because functional equivalence is required for utility but directly tensions with preventing reproduction.
- [Evaluation section] Evaluation section: the manuscript lacks sufficient detail on attack implementations, baseline constructions, and statistical controls (e.g., variance across runs or significance testing), which prevents verification of the reported performance gap over the 42.80% baseline.
minor comments (2)
- [Method section] Clarify the exact procedure for constructing the proxy prompt and any hyperparameters involved, as these appear central to reproducibility.
- [Results] Ensure tables reporting the 264-pair results include per-category breakdowns or confidence intervals to strengthen the aggregate 94.70% claim.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify key aspects of our evaluation protocol. We address each major comment below and have revised the manuscript to incorporate the requested clarifications and details.
read point-by-point responses
-
Referee: [Abstract and Evaluation section] Abstract and Evaluation section: the 94.70% protection figure is presented as evidence that attackers 'cannot reproduce the task,' yet it is unclear whether extraction success is defined strictly as recovering the exact original prompt string or as enabling an attacker to reproduce the original task behavior or sensitive criteria via follow-up queries to the proxy. This distinction is load-bearing for the security claim because functional equivalence is required for utility but directly tensions with preventing reproduction.
Authors: We appreciate the referee pointing out this critical distinction in our threat model. In ProxyPrompt, extraction success is defined as the attacker being able to reproduce the original task behavior or access sensitive criteria (e.g., filtering rules) via follow-up queries when using the extracted proxy prompt. This is distinct from exact string recovery of the original prompt. The proxy is designed to preserve utility for legitimate use while obfuscating the prompt such that extracted versions do not enable functional reproduction of the task or sensitive behaviors by an adversary. We have revised the Abstract and Evaluation sections to explicitly state this definition of extraction success and how the 94.70% protection rate was measured under this criterion, resolving the noted tension. revision: yes
-
Referee: [Evaluation section] Evaluation section: the manuscript lacks sufficient detail on attack implementations, baseline constructions, and statistical controls (e.g., variance across runs or significance testing), which prevents verification of the reported performance gap over the 42.80% baseline.
Authors: We agree that greater detail is required for reproducibility and to substantiate the performance claims. In the revised manuscript, we have expanded the Evaluation section with: (i) full specifications of the attack implementations, including query templates and prompting strategies employed; (ii) precise descriptions of how each baseline defense was constructed and parameterized; and (iii) statistical controls, including standard deviations across repeated runs and results of significance testing (paired t-tests) confirming the gap versus the 42.80% baseline is statistically significant. These additions enable independent verification of the results. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces ProxyPrompt as an empirical defense that replaces system prompts with proxies and validates the approach through direct experiments on 264 LLM-prompt pairs, reporting a 94.70% protection rate. No equations, parameter fits presented as predictions, self-referential definitions, or load-bearing self-citation chains appear in the provided text. The central security claim rests on reported experimental outcomes against extraction attacks rather than any derivation that reduces to its own inputs by construction, making the work self-contained against the stated benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM behavior on a task remains functionally equivalent when the system prompt is replaced by a suitably designed proxy.
invented entities (1)
-
Proxy prompt
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
arg min ˜ϕP [ (1/|Q|) Σ L(fϕP(ϕQ), f˜ϕP(ϕQ)) + L(f˜ϕP||ϕP′(ϕQ′), ˜P) ] (Eq. 3); continuous-to-discrete gap quantified by cosine similarity to nearest vocabulary tokens
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Semantic-Match (SM) and Most-Similar (MS) metrics using entailment and sentence embeddings to detect rephrased leaks
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.