Erased, But Not Forgotten: Erased Rectified Flow Transformers Still Remain Unsafe Under Concept Attack
Pith reviewed 2026-05-18 11:12 UTC · model grok-4.3
The pith
Concept-erased Flux models can still generate forbidden content when attacked via reverse attention.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Existing concept erasure techniques in rectified flow transformers rely on attention localization to suppress target concepts. ReFlux disrupts this localization through reverse-attention optimization, reinforced by a velocity-guided dynamic that steers the flow matching process toward reactivation and a consistency-preserving objective that maintains global layout and unrelated content. The resulting attack reactivates erased concepts while keeping generation stable and efficient.
What carries the argument
Reverse-attention optimization that targets and disrupts the attention localization relied on by concept erasure in rectified flow models.
If this is right
- Current erasure methods for rectified flow transformers must be re-evaluated for robustness against attention-targeted attacks.
- The proposed attack provides a concrete benchmark for measuring how well any new erasure technique resists reactivation.
- Velocity guidance and consistency preservation can be combined to keep attacks stable without degrading image quality.
- Safety claims for Flux-style models require testing against reverse-attention disruptions rather than only SD-style attacks.
Where Pith is reading between the lines
- Other flow-matching generative models may share the same attention-localization dependence and thus the same vulnerability.
- Defenses could be designed to make attention patterns less localized or to add noise that resists reverse optimization.
- Long-term safety may require erasure methods that alter the underlying velocity field rather than just masking attention.
- Model developers should include reverse-attention attacks in their standard robustness test suites before release.
Load-bearing premise
Concept erasure in Flux works primarily by localizing and suppressing attention to the unwanted concept.
What would settle it
Running the attack on an erased Flux model and measuring whether the target concept appears in generated images at rates comparable to an unerased model, using the same quantitative metrics reported in the paper.
read the original abstract
Recent advances in text-to-image (T2I) diffusion models have enabled impressive generative capabilities, but they also raise significant safety concerns due to the potential to produce harmful or undesirable content. While concept erasure has been explored as a mitigation strategy, most existing approaches and corresponding attack evaluations are tailored to Stable Diffusion (SD) and exhibit limited effectiveness when transferred to next-generation rectified flow transformers such as Flux. In this work, we present ReFlux, the first concept attack method specifically designed to assess the robustness of concept erasure in the latest rectified flow-based T2I framework. Our approach is motivated by the observation that existing concept erasure techniques, when applied to Flux, fundamentally rely on a phenomenon known as attention localization. Building on this insight, we propose a simple yet effective attack strategy that specifically targets this property. At its core, a reverse-attention optimization strategy is introduced to effectively reactivate suppressed signals while stabilizing attention. This is further reinforced by a velocity-guided dynamic that enhances the robustness of concept reactivation by steering the flow matching process, and a consistency-preserving objective that maintains the global layout and preserves unrelated content. Extensive experiments consistently demonstrate the effectiveness and efficiency of the proposed attack method, establishing a reliable benchmark for evaluating the robustness of concept erasure strategies in rectified flow transformers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ReFlux, the first concept attack specifically targeting concept-erased rectified flow transformers such as Flux. It observes that existing erasure methods on Flux rely on attention localization, then proposes a reverse-attention optimization to reactivate suppressed signals, augmented by a velocity-guided dynamic to steer flow matching and a consistency-preserving objective to maintain global layout and unrelated content. The authors claim that extensive experiments demonstrate the attack's effectiveness and efficiency, thereby establishing a benchmark for evaluating the robustness of concept erasure in rectified flow models.
Significance. If the experimental claims hold with rigorous quantitative support, the work would be significant for AI safety in text-to-image generation. It shifts the focus from Stable Diffusion to next-generation rectified flow architectures, showing that current erasure techniques may leave models vulnerable to targeted reactivation attacks. The provision of a dedicated attack benchmark could usefully inform the design of more robust erasure strategies or alternative safety mechanisms.
major comments (2)
- [Abstract] Abstract: the central claim that 'extensive experiments consistently demonstrate the effectiveness and efficiency' and 'establishing a reliable benchmark' is load-bearing, yet the abstract supplies no quantitative metrics, baseline comparisons, success rates, or details on data exclusion, statistical controls, or evaluation protocols. This absence leaves the effectiveness assertion without verifiable support.
- [Abstract] Abstract (motivation paragraph): the premise that erasure techniques 'fundamentally rely on a phenomenon known as attention localization' and that the attack 'specifically targets this property' is load-bearing for the entire ReFlux construction (reverse-attention + velocity guidance + consistency). No derivation, ablation, or empirical isolation of this dependence is referenced; if erasure instead primarily alters the velocity field without strong localization, reported success could reflect generic destabilization rather than the claimed mechanism.
minor comments (1)
- [Abstract] The abstract is a single dense paragraph; breaking it into shorter paragraphs or adding a sentence on experimental scale would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our claims. We address each point below and revise the manuscript where appropriate to strengthen verifiability and motivation.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'extensive experiments consistently demonstrate the effectiveness and efficiency' and 'establishing a reliable benchmark' is load-bearing, yet the abstract supplies no quantitative metrics, baseline comparisons, success rates, or details on data exclusion, statistical controls, or evaluation protocols. This absence leaves the effectiveness assertion without verifiable support.
Authors: We agree that the abstract would benefit from concrete quantitative anchors. The full manuscript reports these details in Sections 4 and 5, including success rates, baseline comparisons against prior attacks, and the evaluation protocol on held-out prompts with standard metrics. In the revised version we will incorporate representative quantitative highlights and a concise reference to the evaluation setup directly into the abstract. revision: yes
-
Referee: [Abstract] Abstract (motivation paragraph): the premise that erasure techniques 'fundamentally rely on a phenomenon known as attention localization' and that the attack 'specifically targets this property' is load-bearing for the entire ReFlux construction (reverse-attention + velocity guidance + consistency). No derivation, ablation, or empirical isolation of this dependence is referenced; if erasure instead primarily alters the velocity field without strong localization, reported success could reflect generic destabilization rather than the claimed mechanism.
Authors: The premise is grounded in our empirical analysis of existing erasure methods on Flux, where attention-map visualizations and targeted ablations (detailed in the introduction and experimental sections) demonstrate that suppression occurs primarily via attention localization rather than broad velocity-field changes. Ablations further show that removing the reverse-attention component substantially reduces reactivation success while generic perturbations without localization targeting perform markedly worse. We will add explicit cross-references to these analyses in the abstract and expand the motivation paragraph to include a brief statement of the empirical isolation. revision: partial
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper motivates its ReFlux attack from an empirical observation that concept erasure on Flux relies on attention localization, then introduces independent optimization terms (reverse-attention, velocity guidance, consistency preservation) whose effectiveness is assessed via experiments. No equations, fitted parameters, or self-citations are shown that reduce any claimed prediction or result to the inputs by construction. The central premise is presented as an observation rather than a derived quantity, and the attack components do not loop back to redefine or presuppose the localization property they target.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing concept erasure techniques applied to Flux fundamentally rely on attention localization.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
existing concept erasure techniques, when applied to Flux, fundamentally rely on a phenomenon known as attention localization... reverse-attention optimization strategy... velocity-guided dynamic... consistency-preserving objective
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Rectified flow transformers... flow matching mechanisms... transformer-based backbones
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.