Bypassing Prompt Guards in Production with Controlled-Release Prompting

Jaiden Fairoze; Keewoo Lee; Mingyuan Wang; Sanjam Garg

arxiv: 2510.01529 · v3 · pith:C4JKRYZ4new · submitted 2025-10-02 · 💻 cs.LG · cs.CR

Bypassing Prompt Guards in Production with Controlled-Release Prompting

Jaiden Fairoze , Sanjam Garg , Keewoo Lee , Mingyuan Wang This is my paper

classification 💻 cs.LG cs.CR

keywords attackchatmodelpromptcontrolled-releasefilterfiltersgemini

0 comments

read the original abstract

Ball et al. recently established that prompt filtering for AI alignment faces a fundamental barrier: under standard cryptographic assumptions, no filter running significantly faster than the protected model can universally distinguish adversarial prompts from benign ones. We investigate whether this impossibility result translates to real-world vulnerabilities in deployed large language model (LLM) systems. We answer affirmatively by introducing controlled-release prompting, a practical instantiation of the theoretical framework that exploits the resource asymmetry between lightweight input filters and the main models they protect. Unlike the theoretical construction, our attack does not require model modification: it generates malicious prompts that are indecipherable by any bounded filter yet remain tractable to the target LLM. We find our attack to be successful on four major chat platforms (Google Gemini, DeepSeek Chat, xAI Grok, and Mistral Le Chat) where baseline methods fail. Additionally, we apply our attack to extract copyrighted data from Gemini. Finally, we provide a systematic evaluation of 14 open-weight prompt guard models, revealing that even reasoning-capable filters cannot reliably detect our attack without incurring prohibitive resource overhead.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Exploring and Developing a Pre-Model Safeguard with Draft Models
cs.CR 2026-05 unverdicted novelty 6.0

A safeguard that uses speculative inference on small language models to produce draft responses for safety prediction, lowering false negatives in pre-model jailbreak detection.