Red-Teaming Text-to-Image Models via In-Context Experience Replay and Semantic-Preserving Prompt Rewriting
Pith reviewed 2026-05-23 08:03 UTC · model grok-4.3
The pith
A black-box framework uses LLM rewriting and experience replay to generate fluent adversarial prompts that evade text-to-image safety filters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ICER is a black-box framework that addresses the gap in fluent adversarial prompt generation through an LLM-based rewriter that produces natural-language prompts preserving harmful intent and in-context experience replay that accumulates successful jailbreaking patterns into a reusable prior, with the two components integrated via bandit optimization to balance exploitation and exploration.
What carries the argument
In-context experience replay, which accumulates successful jailbreaking patterns into a reusable prior and is combined with bandit optimization to select between known and new attack strategies.
If this is right
- ICER outperforms seven baselines across six safety mechanisms in both standard and semantics-preserving evaluations.
- Over 30 percent of the generated prompts transfer to commercial systems such as DALL-E 3 and Midjourney.
- The accumulated prior from successful attacks becomes available for future red-teaming without restarting from scratch.
- Bandit-driven selection allows efficient trade-off between reusing proven patterns and trying new prompt variants.
Where Pith is reading between the lines
- The same replay-plus-rewriter structure could be tested on other generative domains such as text or audio models.
- The stored patterns might serve as training data for improving future safety filters rather than only for attacks.
- Dependence on a separate LLM for rewriting introduces a new point of failure if that LLM itself refuses harmful rewrites.
Load-bearing premise
An LLM rewriter can reliably turn harmful intents into fluent natural-language prompts that keep the original meaning while evading safety mechanisms.
What would settle it
A controlled test in which the LLM rewriter produces prompts that either drop the harmful intent or fail to evade filters on a held-out safety mechanism after the bandit has run for a fixed number of trials.
read the original abstract
Understanding the capabilities of text-to-image (T2I) models in harmful content generation is essential to safety and compliance. However, human red-teaming is costly and inconsistent, driving the need for automatic tools that simulate realistic misuse attempts. Existing methods either require white-box access, fail to generalize across defenses, or produce uninterpretable adversarial tokens, while generating fluent prompts that preserve the original harmful intent remains underexplored despite its practical relevance. We propose ICER, a black-box framework that addresses this gap through two components: an LLM-based rewriter that produces fluent, natural-language adversarial prompts, and in-context experience replay that accumulates successful jailbreaking patterns into a reusable prior. These components are integrated via bandit optimization, enabling ICER to efficiently balance exploiting proven attack strategies with exploring new ones. Experiments across six safety mechanisms show that ICER outperforms seven baselines under both standard and semantics-preserving evaluation, with over 30% of generated prompts transferring to commercial systems like DALL-E 3 and Midjourney.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ICER, a black-box framework for red-teaming text-to-image models. It consists of an LLM-based rewriter that generates fluent, natural-language adversarial prompts preserving harmful intent and an in-context experience replay mechanism that accumulates successful jailbreaking patterns, integrated via bandit optimization to balance exploitation and exploration. The central claim is that experiments across six safety mechanisms show ICER outperforming seven baselines under both standard and semantics-preserving evaluations, with over 30% of generated prompts transferring to commercial systems such as DALL-E 3 and Midjourney.
Significance. If the experimental results hold with appropriate controls and statistical support, the work would address a practical gap in automated, interpretable red-teaming tools for T2I models by emphasizing fluent prompt generation and black-box transferability, which could aid safety evaluations and compliance efforts in generative AI.
major comments (1)
- [Abstract] Abstract: The manuscript reports outperformance across six mechanisms and seven baselines plus transfer rates exceeding 30% but supplies no experimental details, baseline descriptions, dataset information, statistical analysis, error bars, or result tables. Full methods and results sections are required to evaluate support for the central empirical claims.
Simulated Author's Rebuttal
We thank the referee for their review. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The manuscript reports outperformance across six mechanisms and seven baselines plus transfer rates exceeding 30% but supplies no experimental details, baseline descriptions, dataset information, statistical analysis, error bars, or result tables. Full methods and results sections are required to evaluate support for the central empirical claims.
Authors: The text provided for review consists solely of the abstract, which is a concise summary of the work. As a result, the full methods, baseline descriptions, dataset information, statistical analysis, error bars, and result tables are not available in the supplied manuscript. The abstract follows standard length constraints and reports high-level outcomes; the complete experimental details would appear in dedicated sections of the full paper. revision: no
- Full experimental details, baseline descriptions, dataset information, statistical analysis, error bars, and result tables (only the abstract is available)
Circularity Check
No significant circularity; purely empirical method
full rationale
The provided abstract and full-text note contain no equations, derivations, fitted parameters, or first-principles claims. ICER is described as a black-box empirical framework combining an LLM rewriter and in-context replay via bandit optimization, evaluated experimentally against baselines. No load-bearing step reduces by construction to its inputs, self-citation chains, or renamed known results; the reader's 0.0 assessment is confirmed by the complete absence of any derivational content that could be inspected for circularity.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training
SafeDiffusion-R1 uses online GRPO with CLIP embedding steering to cut inappropriate content from 48.9% to 18.07% and nudity detections from 646 to 15 in diffusion models while raising GenEval scores from 42.08% to 47....
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.