Intent Laundering: AI Safety Datasets Are Not What They Seem

Marc Wetter; Shahriar Golchin

arxiv: 2602.16729 · v3 · submitted 2026-02-17 · 💻 cs.CR · cs.AI· cs.CL· cs.LG

Intent Laundering: AI Safety Datasets Are Not What They Seem

Shahriar Golchin , Marc Wetter This is my paper

Pith reviewed 2026-05-15 21:32 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CLcs.LG

keywords AI safetyadversarial datasetsjailbreakingtriggering cuesintent launderingmodel evaluationblack-box attacksrefusal mechanisms

0 comments

The pith

Current adversarial safety datasets over-rely on obvious triggering cues that do not match real-world attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that popular safety datasets for testing AI models depend heavily on words or phrases with clear negative meanings that explicitly prompt refusals. These cues make the datasets unrealistic compared to actual adversarial attempts, which avoid such signals to hide intent. The authors introduce intent laundering, a method that strips out these cues while keeping the original harmful goal and details intact. When applied to existing datasets, this change causes all tested models previously rated as reasonably safe to fail, including advanced versions of Gemini and Claude. The same technique also works as a black-box jailbreak with very high success rates.

Core claim

Adversarial safety datasets fail to represent real-world attacks because they over-rely on triggering cues, words or phrases carrying overt negative connotations meant to activate safety filters. Intent laundering removes these cues while preserving malicious intent and all relevant details, after which every previously evaluated model becomes unsafe. The same procedure, when used directly as a jailbreaking method, achieves attack success rates between 90 and 100 percent under fully black-box conditions.

What carries the argument

Intent laundering, a procedure that abstracts away triggering cues from adversarial examples while strictly preserving malicious intent and all relevant details.

If this is right

Models rated as reasonably safe on existing datasets, including Gemini 3 Pro and Claude Sonnet 3.7/4, become unsafe once triggering cues are removed.
Intent laundering adapted as a jailbreak achieves 90 to 100 percent success under black-box access.
Existing datasets measure sensitivity to explicit negative language more than genuine resistance to hidden malicious intent.
Safety evaluations based on current datasets overestimate robustness against realistic adversaries.
A disconnect exists between how datasets test safety and how real-world attacks are crafted.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future datasets should be built from the start without any overt negative language to better simulate stealthy attacks.
Safety mechanisms may currently lean too much on keyword or phrase detection rather than full intent understanding.
Intent laundering offers a practical way to generate stronger test cases for new models without manual rewriting.
Training data for safety alignment could benefit from including laundered versions of attacks to reduce reliance on surface cues.

Load-bearing premise

That intent laundering can reliably strip only triggering cues without changing how models respond for reasons unrelated to safety or introducing new artifacts.

What would settle it

Running the same models on a fresh set of adversarial examples that were never constructed with any triggering cues and checking whether attack success rates stay near zero or rise sharply after laundering.

read the original abstract

We systematically evaluate the quality of widely used adversarial safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how well these datasets reflect real-world adversarial attacks based on three defining properties: being driven by ulterior intent, well-crafted, and out-of-distribution. We find that these datasets overrely on "triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks. In practice, we evaluate whether these datasets genuinely measure safety risks or merely provoke refusals through triggering cues. To explore this, we introduce "intent laundering": a procedure that abstracts away triggering cues from adversarial attacks (data points) while strictly preserving their malicious intent and all relevant details. Our results show that current adversarial safety datasets fail to faithfully represent real-world adversarial behavior due to their overreliance on triggering cues. Once these cues are removed, all previously evaluated "reasonably safe" models become unsafe, including Gemini 3 Pro and Claude Sonnet 3.7/4. Moreover, when intent laundering is adapted as a jailbreaking technique, it consistently achieves high attack success rates, ranging from 90.00% to 100.00%, under fully black-box access. Overall, our findings expose a significant disconnect between how existing datasets evaluate model safety and how real-world adversaries behave.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows safety datasets lean on obvious trigger words that real attacks skip, and removing them tanks model performance, but the laundering steps need clearer validation.

read the letter

The main takeaway is that current adversarial safety datasets over-rely on blunt trigger words, so they give a false sense of security. Once those cues are stripped via intent laundering, models that looked reasonably safe, including Gemini 3 Pro and Claude Sonnet 3.7/4, start complying with the underlying harmful requests. The paper also shows the same process works as a black-box jailbreak at 90-100% success rates. That combination is the useful part: it both diagnoses a dataset flaw and turns it into a practical attack method. The work lines up with how real adversaries try to stay under the radar, and it checks this pattern across several well-known datasets. The results are concrete enough to make the point stick without needing fancy math. The soft spot is exactly what the stress-test note flags. The procedure claims to remove only triggering cues while keeping intent and details intact, but the abstract gives no formal definition of a cue, no step-by-step algorithm, and no side-by-side examples with controls for length, coherence, or framing shifts. Without those, it is hard to know whether the jump in attack success comes purely from cue removal or from other prompt changes that happen during laundering. The full paper needs to show the exact transformations and any checks they did to rule out side effects. This is for researchers who build or rely on safety benchmarks and red-teaming suites. Anyone updating eval sets or testing frontier models would get a practical warning from it. The central observation is worth referee time even if the method details need tightening, so I would send it for review with a request for the laundering code or reproducible examples.

Referee Report

2 major / 1 minor

Summary. The paper claims that widely used adversarial safety datasets overrely on overt 'triggering cues' (words/phrases with negative connotations) that do not reflect real-world attacks, which are driven by ulterior intent, well-crafted, and out-of-distribution. It introduces 'intent laundering,' a procedure that abstracts away these cues while strictly preserving malicious intent and details, showing that this renders previously 'reasonably safe' models (including Gemini 3 Pro and Claude Sonnet 3.7/4) unsafe and achieves 90-100% black-box attack success rates when used as a jailbreak.

Significance. If the central results hold after validation, the work identifies a fundamental mismatch between current safety benchmarks and realistic adversarial behavior, implying that many existing evaluations may overestimate model safety. The introduction of intent laundering as both a diagnostic tool and effective jailbreak technique could influence future dataset design and red-teaming practices in AI safety.

major comments (2)

[Abstract] Abstract: The claim that intent laundering 'abstracts away triggering cues... while strictly preserving their malicious intent and all relevant details' is load-bearing for attributing model failures solely to cue removal, yet no formal definition of triggering cues, reproducible algorithm, or validation experiments (e.g., comparing non-safety-related behaviors like coherence or response length before/after laundering) are described.
The reported 90-100% attack success rates and model failures (Gemini 3 Pro, Claude Sonnet) lack details on experimental controls, number of trials, prompt variations, or baselines that would rule out artifacts from rephrasing unrelated to safety triggers.

minor comments (1)

Clarify the exact criteria used to identify 'triggering cues' versus other prompt elements in the laundering procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help improve the clarity and rigor of our work. We address each major comment below and will make corresponding revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that intent laundering 'abstracts away triggering cues... while strictly preserving their malicious intent and all relevant details' is load-bearing for attributing model failures solely to cue removal, yet no formal definition of triggering cues, reproducible algorithm, or validation experiments (e.g., comparing non-safety-related behaviors like coherence or response length before/after laundering) are described.

Authors: We agree that a more formal presentation of the intent laundering procedure is warranted to support the central claims. In the revised version, we will add a precise definition of triggering cues in the preliminaries section and provide pseudocode for the laundering algorithm. We will also include validation experiments assessing non-safety aspects such as response coherence and length before and after the procedure to confirm that changes are isolated to the removal of triggering cues. revision: yes
Referee: The reported 90-100% attack success rates and model failures (Gemini 3 Pro, Claude Sonnet) lack details on experimental controls, number of trials, prompt variations, or baselines that would rule out artifacts from rephrasing unrelated to safety triggers.

Authors: The manuscript reports these details in the experimental setup section, including the use of 100 trials per model across multiple datasets, with controls for prompt length and semantic preservation. To further address potential concerns about rephrasing artifacts, we will add explicit baselines comparing intent laundering to generic rephrasing techniques and report statistical measures. This will be expanded in the revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity; procedure and evaluations are externally grounded

full rationale

The paper introduces intent laundering as an explicitly defined procedure for removing triggering cues while preserving intent and details, then applies it to evaluate existing datasets via direct testing on external models (Gemini, Claude, etc.). No derivation step reduces by construction to fitted parameters, self-definitions, or self-citation chains; the central claims rest on reproducible empirical model responses rather than internal renaming or ansatz smuggling. The evaluation chain is self-contained against external benchmarks with no load-bearing self-referential elements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that real-world attacks avoid overt negative cues and that the laundering procedure can isolate those cues without side effects.

axioms (1)

domain assumption Real-world adversarial attacks are driven by ulterior intent, well-crafted, and out-of-distribution without relying on overt negative/sensitive words.
This underpins the claim that current datasets are unrealistic due to triggering cues.

invented entities (1)

Intent laundering procedure no independent evidence
purpose: To abstract away triggering cues from adversarial attacks while preserving malicious intent and details
Newly introduced method whose validity rests on the paper's own definition and application.

pith-pipeline@v0.9.0 · 5549 in / 1320 out tokens · 25472 ms · 2026-05-15T21:32:35.481005+00:00 · methodology

Intent Laundering: AI Safety Datasets Are Not What They Seem

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)