Red-Teaming Text-to-Image Models via In-Context Experience Replay and Semantic-Preserving Prompt Rewriting

Mario Fritz; Pin-Yu Chen; Wei-Chen Chiu; Zhi-Yi Chin

arxiv: 2411.16769 · v3 · submitted 2024-11-25 · 💻 cs.LG · cs.CL· cs.CR· cs.CV

Red-Teaming Text-to-Image Models via In-Context Experience Replay and Semantic-Preserving Prompt Rewriting

Zhi-Yi Chin , Pin-Yu Chen , Wei-Chen Chiu , Mario Fritz This is my paper

Pith reviewed 2026-05-23 08:03 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CRcs.CV

keywords red-teamingtext-to-image modelsadversarial promptsjailbreakingprompt rewritingexperience replayblack-box attacksafety evaluation

0 comments

The pith

A black-box framework uses LLM rewriting and experience replay to generate fluent adversarial prompts that evade text-to-image safety filters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ICER as a way to automatically simulate misuse of text-to-image models without white-box access or unnatural token strings. An LLM rewriter turns harmful intents into natural-sounding prompts, while in-context experience replay stores patterns from past successes for reuse. Bandit optimization decides whether to exploit those patterns or explore new ones. Experiments show the method beats seven baselines across six defenses and achieves transfer rates above 30 percent to commercial systems under both standard and meaning-preserving checks.

Core claim

ICER is a black-box framework that addresses the gap in fluent adversarial prompt generation through an LLM-based rewriter that produces natural-language prompts preserving harmful intent and in-context experience replay that accumulates successful jailbreaking patterns into a reusable prior, with the two components integrated via bandit optimization to balance exploitation and exploration.

What carries the argument

In-context experience replay, which accumulates successful jailbreaking patterns into a reusable prior and is combined with bandit optimization to select between known and new attack strategies.

If this is right

ICER outperforms seven baselines across six safety mechanisms in both standard and semantics-preserving evaluations.
Over 30 percent of the generated prompts transfer to commercial systems such as DALL-E 3 and Midjourney.
The accumulated prior from successful attacks becomes available for future red-teaming without restarting from scratch.
Bandit-driven selection allows efficient trade-off between reusing proven patterns and trying new prompt variants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same replay-plus-rewriter structure could be tested on other generative domains such as text or audio models.
The stored patterns might serve as training data for improving future safety filters rather than only for attacks.
Dependence on a separate LLM for rewriting introduces a new point of failure if that LLM itself refuses harmful rewrites.

Load-bearing premise

An LLM rewriter can reliably turn harmful intents into fluent natural-language prompts that keep the original meaning while evading safety mechanisms.

What would settle it

A controlled test in which the LLM rewriter produces prompts that either drop the harmful intent or fail to evade filters on a held-out safety mechanism after the bandit has run for a fixed number of trials.

read the original abstract

Understanding the capabilities of text-to-image (T2I) models in harmful content generation is essential to safety and compliance. However, human red-teaming is costly and inconsistent, driving the need for automatic tools that simulate realistic misuse attempts. Existing methods either require white-box access, fail to generalize across defenses, or produce uninterpretable adversarial tokens, while generating fluent prompts that preserve the original harmful intent remains underexplored despite its practical relevance. We propose ICER, a black-box framework that addresses this gap through two components: an LLM-based rewriter that produces fluent, natural-language adversarial prompts, and in-context experience replay that accumulates successful jailbreaking patterns into a reusable prior. These components are integrated via bandit optimization, enabling ICER to efficiently balance exploiting proven attack strategies with exploring new ones. Experiments across six safety mechanisms show that ICER outperforms seven baselines under both standard and semantics-preserving evaluation, with over 30% of generated prompts transferring to commercial systems like DALL-E 3 and Midjourney.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ICER is a black-box red-teaming method for T2I models that combines LLM prompt rewriting with experience replay and bandits, but the abstract gives no experimental details to judge whether the claimed gains hold.

read the letter

The paper's main contribution is ICER, a framework that rewrites harmful prompts with an LLM to keep them fluent and semantically close to the original intent, then reuses successful attacks via in-context replay and tunes the mix of old and new strategies with bandit optimization. This targets the practical gap where most prior attacks either need white-box access or spit out unreadable token strings. The abstract positions the fluent-prompt focus as underexplored and reports that the method beats seven baselines across six safety mechanisms while transferring over 30 percent of prompts to DALL-E 3 and Midjourney. That combination of components is presented as new for this setting, and the emphasis on realistic, human-like prompts is a reasonable practical angle if the results back it up. The work is clearly aimed at people building or auditing deployed T2I systems who need scalable ways to test defenses without manual effort. The soft spot is that everything rests on the abstract. There are no descriptions of the baselines, no implementation details on the rewriter or replay buffer, no statistical tests, and no tables. Without those, the outperformance claim cannot be checked for robustness or for whether the semantics-preserving evaluation actually measures what it claims. The transfer numbers are the most concrete result mentioned, but they still lack context on prompt volume or variance. This is the kind of paper that deserves a serious referee once the full methods and results are available, because the problem is real and the proposed direction is coherent on paper. I would bring it to a reading group only after seeing the experiments, and I would not cite it yet. Send it for review rather than desk-reject so the details can be evaluated properly.

Referee Report

1 major / 0 minor

Summary. The paper proposes ICER, a black-box framework for red-teaming text-to-image models. It consists of an LLM-based rewriter that generates fluent, natural-language adversarial prompts preserving harmful intent and an in-context experience replay mechanism that accumulates successful jailbreaking patterns, integrated via bandit optimization to balance exploitation and exploration. The central claim is that experiments across six safety mechanisms show ICER outperforming seven baselines under both standard and semantics-preserving evaluations, with over 30% of generated prompts transferring to commercial systems such as DALL-E 3 and Midjourney.

Significance. If the experimental results hold with appropriate controls and statistical support, the work would address a practical gap in automated, interpretable red-teaming tools for T2I models by emphasizing fluent prompt generation and black-box transferability, which could aid safety evaluations and compliance efforts in generative AI.

major comments (1)

[Abstract] Abstract: The manuscript reports outperformance across six mechanisms and seven baselines plus transfer rates exceeding 30% but supplies no experimental details, baseline descriptions, dataset information, statistical analysis, error bars, or result tables. Full methods and results sections are required to evaluate support for the central empirical claims.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their review. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript reports outperformance across six mechanisms and seven baselines plus transfer rates exceeding 30% but supplies no experimental details, baseline descriptions, dataset information, statistical analysis, error bars, or result tables. Full methods and results sections are required to evaluate support for the central empirical claims.

Authors: The text provided for review consists solely of the abstract, which is a concise summary of the work. As a result, the full methods, baseline descriptions, dataset information, statistical analysis, error bars, and result tables are not available in the supplied manuscript. The abstract follows standard length constraints and reports high-level outcomes; the complete experimental details would appear in dedicated sections of the full paper. revision: no

standing simulated objections not resolved

Full experimental details, baseline descriptions, dataset information, statistical analysis, error bars, and result tables (only the abstract is available)

Circularity Check

0 steps flagged

No significant circularity; purely empirical method

full rationale

The provided abstract and full-text note contain no equations, derivations, fitted parameters, or first-principles claims. ICER is described as a black-box empirical framework combining an LLM rewriter and in-context replay via bandit optimization, evaluated experimentally against baselines. No load-bearing step reduces by construction to its inputs, self-citation chains, or renamed known results; the reader's 0.0 assessment is confirmed by the complete absence of any derivational content that could be inspected for circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5699 in / 1167 out tokens · 42545 ms · 2026-05-23T08:03:11.353028+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training
cs.CV 2026-05 unverdicted novelty 6.0

SafeDiffusion-R1 uses online GRPO with CLIP embedding steering to cut inappropriate content from 48.9% to 18.07% and nudity detections from 646 to 15 in diffusion models while raising GenEval scores from 42.08% to 47....