pith. machine review for the scientific record. sign in

arxiv: 2510.00635 · v5 · submitted 2025-10-01 · 💻 cs.CV

Erased, But Not Forgotten: Erased Rectified Flow Transformers Still Remain Unsafe Under Concept Attack

Pith reviewed 2026-05-18 11:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords concept erasurerectified flowFluxtext-to-imageadversarial attackattention localizationmodel robustnesssafety evaluation
0
0 comments X

The pith

Concept-erased Flux models can still generate forbidden content when attacked via reverse attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that concept erasure methods applied to rectified flow transformers such as Flux remain vulnerable because they depend on attention localization to suppress unwanted signals. The authors introduce ReFlux, an attack that uses reverse-attention optimization to reactivate those signals, supported by velocity guidance that steers the flow-matching process and a consistency term that preserves overall image structure. Experiments indicate the attack succeeds efficiently across multiple erasure techniques, producing a benchmark for testing robustness in this newer model family. A sympathetic reader would care because safety filters in current high-quality image generators can be bypassed without obvious loss of generation quality.

Core claim

Existing concept erasure techniques in rectified flow transformers rely on attention localization to suppress target concepts. ReFlux disrupts this localization through reverse-attention optimization, reinforced by a velocity-guided dynamic that steers the flow matching process toward reactivation and a consistency-preserving objective that maintains global layout and unrelated content. The resulting attack reactivates erased concepts while keeping generation stable and efficient.

What carries the argument

Reverse-attention optimization that targets and disrupts the attention localization relied on by concept erasure in rectified flow models.

If this is right

  • Current erasure methods for rectified flow transformers must be re-evaluated for robustness against attention-targeted attacks.
  • The proposed attack provides a concrete benchmark for measuring how well any new erasure technique resists reactivation.
  • Velocity guidance and consistency preservation can be combined to keep attacks stable without degrading image quality.
  • Safety claims for Flux-style models require testing against reverse-attention disruptions rather than only SD-style attacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other flow-matching generative models may share the same attention-localization dependence and thus the same vulnerability.
  • Defenses could be designed to make attention patterns less localized or to add noise that resists reverse optimization.
  • Long-term safety may require erasure methods that alter the underlying velocity field rather than just masking attention.
  • Model developers should include reverse-attention attacks in their standard robustness test suites before release.

Load-bearing premise

Concept erasure in Flux works primarily by localizing and suppressing attention to the unwanted concept.

What would settle it

Running the attack on an erased Flux model and measuring whether the target concept appears in generated images at rates comparable to an unerased model, using the same quantitative metrics reported in the paper.

read the original abstract

Recent advances in text-to-image (T2I) diffusion models have enabled impressive generative capabilities, but they also raise significant safety concerns due to the potential to produce harmful or undesirable content. While concept erasure has been explored as a mitigation strategy, most existing approaches and corresponding attack evaluations are tailored to Stable Diffusion (SD) and exhibit limited effectiveness when transferred to next-generation rectified flow transformers such as Flux. In this work, we present ReFlux, the first concept attack method specifically designed to assess the robustness of concept erasure in the latest rectified flow-based T2I framework. Our approach is motivated by the observation that existing concept erasure techniques, when applied to Flux, fundamentally rely on a phenomenon known as attention localization. Building on this insight, we propose a simple yet effective attack strategy that specifically targets this property. At its core, a reverse-attention optimization strategy is introduced to effectively reactivate suppressed signals while stabilizing attention. This is further reinforced by a velocity-guided dynamic that enhances the robustness of concept reactivation by steering the flow matching process, and a consistency-preserving objective that maintains the global layout and preserves unrelated content. Extensive experiments consistently demonstrate the effectiveness and efficiency of the proposed attack method, establishing a reliable benchmark for evaluating the robustness of concept erasure strategies in rectified flow transformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ReFlux, the first concept attack specifically targeting concept-erased rectified flow transformers such as Flux. It observes that existing erasure methods on Flux rely on attention localization, then proposes a reverse-attention optimization to reactivate suppressed signals, augmented by a velocity-guided dynamic to steer flow matching and a consistency-preserving objective to maintain global layout and unrelated content. The authors claim that extensive experiments demonstrate the attack's effectiveness and efficiency, thereby establishing a benchmark for evaluating the robustness of concept erasure in rectified flow models.

Significance. If the experimental claims hold with rigorous quantitative support, the work would be significant for AI safety in text-to-image generation. It shifts the focus from Stable Diffusion to next-generation rectified flow architectures, showing that current erasure techniques may leave models vulnerable to targeted reactivation attacks. The provision of a dedicated attack benchmark could usefully inform the design of more robust erasure strategies or alternative safety mechanisms.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'extensive experiments consistently demonstrate the effectiveness and efficiency' and 'establishing a reliable benchmark' is load-bearing, yet the abstract supplies no quantitative metrics, baseline comparisons, success rates, or details on data exclusion, statistical controls, or evaluation protocols. This absence leaves the effectiveness assertion without verifiable support.
  2. [Abstract] Abstract (motivation paragraph): the premise that erasure techniques 'fundamentally rely on a phenomenon known as attention localization' and that the attack 'specifically targets this property' is load-bearing for the entire ReFlux construction (reverse-attention + velocity guidance + consistency). No derivation, ablation, or empirical isolation of this dependence is referenced; if erasure instead primarily alters the velocity field without strong localization, reported success could reflect generic destabilization rather than the claimed mechanism.
minor comments (1)
  1. [Abstract] The abstract is a single dense paragraph; breaking it into shorter paragraphs or adding a sentence on experimental scale would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our claims. We address each point below and revise the manuscript where appropriate to strengthen verifiability and motivation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'extensive experiments consistently demonstrate the effectiveness and efficiency' and 'establishing a reliable benchmark' is load-bearing, yet the abstract supplies no quantitative metrics, baseline comparisons, success rates, or details on data exclusion, statistical controls, or evaluation protocols. This absence leaves the effectiveness assertion without verifiable support.

    Authors: We agree that the abstract would benefit from concrete quantitative anchors. The full manuscript reports these details in Sections 4 and 5, including success rates, baseline comparisons against prior attacks, and the evaluation protocol on held-out prompts with standard metrics. In the revised version we will incorporate representative quantitative highlights and a concise reference to the evaluation setup directly into the abstract. revision: yes

  2. Referee: [Abstract] Abstract (motivation paragraph): the premise that erasure techniques 'fundamentally rely on a phenomenon known as attention localization' and that the attack 'specifically targets this property' is load-bearing for the entire ReFlux construction (reverse-attention + velocity guidance + consistency). No derivation, ablation, or empirical isolation of this dependence is referenced; if erasure instead primarily alters the velocity field without strong localization, reported success could reflect generic destabilization rather than the claimed mechanism.

    Authors: The premise is grounded in our empirical analysis of existing erasure methods on Flux, where attention-map visualizations and targeted ablations (detailed in the introduction and experimental sections) demonstrate that suppression occurs primarily via attention localization rather than broad velocity-field changes. Ablations further show that removing the reverse-attention component substantially reduces reactivation success while generic perturbations without localization targeting perform markedly worse. We will add explicit cross-references to these analyses in the abstract and expand the motivation paragraph to include a brief statement of the empirical isolation. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper motivates its ReFlux attack from an empirical observation that concept erasure on Flux relies on attention localization, then introduces independent optimization terms (reverse-attention, velocity guidance, consistency preservation) whose effectiveness is assessed via experiments. No equations, fitted parameters, or self-citations are shown that reduce any claimed prediction or result to the inputs by construction. The central premise is presented as an observation rather than a derived quantity, and the attack components do not loop back to redefine or presuppose the localization property they target.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that erasure success in Flux is driven by attention localization and that this property can be isolated and reversed without collapsing the generation process.

axioms (1)
  • domain assumption Existing concept erasure techniques applied to Flux fundamentally rely on attention localization.
    This premise is stated in the abstract as the motivation for the attack design.

pith-pipeline@v0.9.0 · 5784 in / 1190 out tokens · 45200 ms · 2026-05-18T11:12:15.615879+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.