Intent Laundering: AI Safety Datasets Are Not What They Seem
Pith reviewed 2026-05-15 21:32 UTC · model grok-4.3
The pith
Current adversarial safety datasets over-rely on obvious triggering cues that do not match real-world attacks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Adversarial safety datasets fail to represent real-world attacks because they over-rely on triggering cues, words or phrases carrying overt negative connotations meant to activate safety filters. Intent laundering removes these cues while preserving malicious intent and all relevant details, after which every previously evaluated model becomes unsafe. The same procedure, when used directly as a jailbreaking method, achieves attack success rates between 90 and 100 percent under fully black-box conditions.
What carries the argument
Intent laundering, a procedure that abstracts away triggering cues from adversarial examples while strictly preserving malicious intent and all relevant details.
If this is right
- Models rated as reasonably safe on existing datasets, including Gemini 3 Pro and Claude Sonnet 3.7/4, become unsafe once triggering cues are removed.
- Intent laundering adapted as a jailbreak achieves 90 to 100 percent success under black-box access.
- Existing datasets measure sensitivity to explicit negative language more than genuine resistance to hidden malicious intent.
- Safety evaluations based on current datasets overestimate robustness against realistic adversaries.
- A disconnect exists between how datasets test safety and how real-world attacks are crafted.
Where Pith is reading between the lines
- Future datasets should be built from the start without any overt negative language to better simulate stealthy attacks.
- Safety mechanisms may currently lean too much on keyword or phrase detection rather than full intent understanding.
- Intent laundering offers a practical way to generate stronger test cases for new models without manual rewriting.
- Training data for safety alignment could benefit from including laundered versions of attacks to reduce reliance on surface cues.
Load-bearing premise
That intent laundering can reliably strip only triggering cues without changing how models respond for reasons unrelated to safety or introducing new artifacts.
What would settle it
Running the same models on a fresh set of adversarial examples that were never constructed with any triggering cues and checking whether attack success rates stay near zero or rise sharply after laundering.
read the original abstract
We systematically evaluate the quality of widely used adversarial safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how well these datasets reflect real-world adversarial attacks based on three defining properties: being driven by ulterior intent, well-crafted, and out-of-distribution. We find that these datasets overrely on "triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks. In practice, we evaluate whether these datasets genuinely measure safety risks or merely provoke refusals through triggering cues. To explore this, we introduce "intent laundering": a procedure that abstracts away triggering cues from adversarial attacks (data points) while strictly preserving their malicious intent and all relevant details. Our results show that current adversarial safety datasets fail to faithfully represent real-world adversarial behavior due to their overreliance on triggering cues. Once these cues are removed, all previously evaluated "reasonably safe" models become unsafe, including Gemini 3 Pro and Claude Sonnet 3.7/4. Moreover, when intent laundering is adapted as a jailbreaking technique, it consistently achieves high attack success rates, ranging from 90.00% to 100.00%, under fully black-box access. Overall, our findings expose a significant disconnect between how existing datasets evaluate model safety and how real-world adversaries behave.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that widely used adversarial safety datasets overrely on overt 'triggering cues' (words/phrases with negative connotations) that do not reflect real-world attacks, which are driven by ulterior intent, well-crafted, and out-of-distribution. It introduces 'intent laundering,' a procedure that abstracts away these cues while strictly preserving malicious intent and details, showing that this renders previously 'reasonably safe' models (including Gemini 3 Pro and Claude Sonnet 3.7/4) unsafe and achieves 90-100% black-box attack success rates when used as a jailbreak.
Significance. If the central results hold after validation, the work identifies a fundamental mismatch between current safety benchmarks and realistic adversarial behavior, implying that many existing evaluations may overestimate model safety. The introduction of intent laundering as both a diagnostic tool and effective jailbreak technique could influence future dataset design and red-teaming practices in AI safety.
major comments (2)
- [Abstract] Abstract: The claim that intent laundering 'abstracts away triggering cues... while strictly preserving their malicious intent and all relevant details' is load-bearing for attributing model failures solely to cue removal, yet no formal definition of triggering cues, reproducible algorithm, or validation experiments (e.g., comparing non-safety-related behaviors like coherence or response length before/after laundering) are described.
- The reported 90-100% attack success rates and model failures (Gemini 3 Pro, Claude Sonnet) lack details on experimental controls, number of trials, prompt variations, or baselines that would rule out artifacts from rephrasing unrelated to safety triggers.
minor comments (1)
- Clarify the exact criteria used to identify 'triggering cues' versus other prompt elements in the laundering procedure.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help improve the clarity and rigor of our work. We address each major comment below and will make corresponding revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that intent laundering 'abstracts away triggering cues... while strictly preserving their malicious intent and all relevant details' is load-bearing for attributing model failures solely to cue removal, yet no formal definition of triggering cues, reproducible algorithm, or validation experiments (e.g., comparing non-safety-related behaviors like coherence or response length before/after laundering) are described.
Authors: We agree that a more formal presentation of the intent laundering procedure is warranted to support the central claims. In the revised version, we will add a precise definition of triggering cues in the preliminaries section and provide pseudocode for the laundering algorithm. We will also include validation experiments assessing non-safety aspects such as response coherence and length before and after the procedure to confirm that changes are isolated to the removal of triggering cues. revision: yes
-
Referee: The reported 90-100% attack success rates and model failures (Gemini 3 Pro, Claude Sonnet) lack details on experimental controls, number of trials, prompt variations, or baselines that would rule out artifacts from rephrasing unrelated to safety triggers.
Authors: The manuscript reports these details in the experimental setup section, including the use of 100 trials per model across multiple datasets, with controls for prompt length and semantic preservation. To further address potential concerns about rephrasing artifacts, we will add explicit baselines comparing intent laundering to generic rephrasing techniques and report statistical measures. This will be expanded in the revision. revision: partial
Circularity Check
No significant circularity; procedure and evaluations are externally grounded
full rationale
The paper introduces intent laundering as an explicitly defined procedure for removing triggering cues while preserving intent and details, then applies it to evaluate existing datasets via direct testing on external models (Gemini, Claude, etc.). No derivation step reduces by construction to fitted parameters, self-definitions, or self-citation chains; the central claims rest on reproducible empirical model responses rather than internal renaming or ansatz smuggling. The evaluation chain is self-contained against external benchmarks with no load-bearing self-referential elements.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Real-world adversarial attacks are driven by ulterior intent, well-crafted, and out-of-distribution without relying on overt negative/sensitive words.
invented entities (1)
-
Intent laundering procedure
no independent evidence
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.