pith. sign in

arxiv: 2602.16729 · v3 · submitted 2026-02-17 · 💻 cs.CR · cs.AI· cs.CL· cs.LG

Intent Laundering: AI Safety Datasets Are Not What They Seem

Pith reviewed 2026-05-15 21:32 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CLcs.LG
keywords AI safetyadversarial datasetsjailbreakingtriggering cuesintent launderingmodel evaluationblack-box attacksrefusal mechanisms
0
0 comments X

The pith

Current adversarial safety datasets over-rely on obvious triggering cues that do not match real-world attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that popular safety datasets for testing AI models depend heavily on words or phrases with clear negative meanings that explicitly prompt refusals. These cues make the datasets unrealistic compared to actual adversarial attempts, which avoid such signals to hide intent. The authors introduce intent laundering, a method that strips out these cues while keeping the original harmful goal and details intact. When applied to existing datasets, this change causes all tested models previously rated as reasonably safe to fail, including advanced versions of Gemini and Claude. The same technique also works as a black-box jailbreak with very high success rates.

Core claim

Adversarial safety datasets fail to represent real-world attacks because they over-rely on triggering cues, words or phrases carrying overt negative connotations meant to activate safety filters. Intent laundering removes these cues while preserving malicious intent and all relevant details, after which every previously evaluated model becomes unsafe. The same procedure, when used directly as a jailbreaking method, achieves attack success rates between 90 and 100 percent under fully black-box conditions.

What carries the argument

Intent laundering, a procedure that abstracts away triggering cues from adversarial examples while strictly preserving malicious intent and all relevant details.

If this is right

  • Models rated as reasonably safe on existing datasets, including Gemini 3 Pro and Claude Sonnet 3.7/4, become unsafe once triggering cues are removed.
  • Intent laundering adapted as a jailbreak achieves 90 to 100 percent success under black-box access.
  • Existing datasets measure sensitivity to explicit negative language more than genuine resistance to hidden malicious intent.
  • Safety evaluations based on current datasets overestimate robustness against realistic adversaries.
  • A disconnect exists between how datasets test safety and how real-world attacks are crafted.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future datasets should be built from the start without any overt negative language to better simulate stealthy attacks.
  • Safety mechanisms may currently lean too much on keyword or phrase detection rather than full intent understanding.
  • Intent laundering offers a practical way to generate stronger test cases for new models without manual rewriting.
  • Training data for safety alignment could benefit from including laundered versions of attacks to reduce reliance on surface cues.

Load-bearing premise

That intent laundering can reliably strip only triggering cues without changing how models respond for reasons unrelated to safety or introducing new artifacts.

What would settle it

Running the same models on a fresh set of adversarial examples that were never constructed with any triggering cues and checking whether attack success rates stay near zero or rise sharply after laundering.

read the original abstract

We systematically evaluate the quality of widely used adversarial safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how well these datasets reflect real-world adversarial attacks based on three defining properties: being driven by ulterior intent, well-crafted, and out-of-distribution. We find that these datasets overrely on "triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks. In practice, we evaluate whether these datasets genuinely measure safety risks or merely provoke refusals through triggering cues. To explore this, we introduce "intent laundering": a procedure that abstracts away triggering cues from adversarial attacks (data points) while strictly preserving their malicious intent and all relevant details. Our results show that current adversarial safety datasets fail to faithfully represent real-world adversarial behavior due to their overreliance on triggering cues. Once these cues are removed, all previously evaluated "reasonably safe" models become unsafe, including Gemini 3 Pro and Claude Sonnet 3.7/4. Moreover, when intent laundering is adapted as a jailbreaking technique, it consistently achieves high attack success rates, ranging from 90.00% to 100.00%, under fully black-box access. Overall, our findings expose a significant disconnect between how existing datasets evaluate model safety and how real-world adversaries behave.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that widely used adversarial safety datasets overrely on overt 'triggering cues' (words/phrases with negative connotations) that do not reflect real-world attacks, which are driven by ulterior intent, well-crafted, and out-of-distribution. It introduces 'intent laundering,' a procedure that abstracts away these cues while strictly preserving malicious intent and details, showing that this renders previously 'reasonably safe' models (including Gemini 3 Pro and Claude Sonnet 3.7/4) unsafe and achieves 90-100% black-box attack success rates when used as a jailbreak.

Significance. If the central results hold after validation, the work identifies a fundamental mismatch between current safety benchmarks and realistic adversarial behavior, implying that many existing evaluations may overestimate model safety. The introduction of intent laundering as both a diagnostic tool and effective jailbreak technique could influence future dataset design and red-teaming practices in AI safety.

major comments (2)
  1. [Abstract] Abstract: The claim that intent laundering 'abstracts away triggering cues... while strictly preserving their malicious intent and all relevant details' is load-bearing for attributing model failures solely to cue removal, yet no formal definition of triggering cues, reproducible algorithm, or validation experiments (e.g., comparing non-safety-related behaviors like coherence or response length before/after laundering) are described.
  2. The reported 90-100% attack success rates and model failures (Gemini 3 Pro, Claude Sonnet) lack details on experimental controls, number of trials, prompt variations, or baselines that would rule out artifacts from rephrasing unrelated to safety triggers.
minor comments (1)
  1. Clarify the exact criteria used to identify 'triggering cues' versus other prompt elements in the laundering procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help improve the clarity and rigor of our work. We address each major comment below and will make corresponding revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that intent laundering 'abstracts away triggering cues... while strictly preserving their malicious intent and all relevant details' is load-bearing for attributing model failures solely to cue removal, yet no formal definition of triggering cues, reproducible algorithm, or validation experiments (e.g., comparing non-safety-related behaviors like coherence or response length before/after laundering) are described.

    Authors: We agree that a more formal presentation of the intent laundering procedure is warranted to support the central claims. In the revised version, we will add a precise definition of triggering cues in the preliminaries section and provide pseudocode for the laundering algorithm. We will also include validation experiments assessing non-safety aspects such as response coherence and length before and after the procedure to confirm that changes are isolated to the removal of triggering cues. revision: yes

  2. Referee: The reported 90-100% attack success rates and model failures (Gemini 3 Pro, Claude Sonnet) lack details on experimental controls, number of trials, prompt variations, or baselines that would rule out artifacts from rephrasing unrelated to safety triggers.

    Authors: The manuscript reports these details in the experimental setup section, including the use of 100 trials per model across multiple datasets, with controls for prompt length and semantic preservation. To further address potential concerns about rephrasing artifacts, we will add explicit baselines comparing intent laundering to generic rephrasing techniques and report statistical measures. This will be expanded in the revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity; procedure and evaluations are externally grounded

full rationale

The paper introduces intent laundering as an explicitly defined procedure for removing triggering cues while preserving intent and details, then applies it to evaluate existing datasets via direct testing on external models (Gemini, Claude, etc.). No derivation step reduces by construction to fitted parameters, self-definitions, or self-citation chains; the central claims rest on reproducible empirical model responses rather than internal renaming or ansatz smuggling. The evaluation chain is self-contained against external benchmarks with no load-bearing self-referential elements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that real-world attacks avoid overt negative cues and that the laundering procedure can isolate those cues without side effects.

axioms (1)
  • domain assumption Real-world adversarial attacks are driven by ulterior intent, well-crafted, and out-of-distribution without relying on overt negative/sensitive words.
    This underpins the claim that current datasets are unrealistic due to triggering cues.
invented entities (1)
  • Intent laundering procedure no independent evidence
    purpose: To abstract away triggering cues from adversarial attacks while preserving malicious intent and details
    Newly introduced method whose validity rests on the paper's own definition and application.

pith-pipeline@v0.9.0 · 5549 in / 1320 out tokens · 25472 ms · 2026-05-15T21:32:35.481005+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.