pith. sign in

arxiv: 2603.00975 · v2 · pith:N7I7Z6E3new · submitted 2026-03-01 · 💻 cs.LG · cs.AI

Forgetting is Competition: Rethinking Unlearning as Representation Interference in Diffusion Models

Pith reviewed 2026-05-21 12:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords concept unlearningdiffusion modelstext-to-image generationrepresentation interferenceerase-retain balancegradient competitionattention localization
0
0 comments X

The pith

Unlearning specific concepts from diffusion models works better by creating competition from distractor representations instead of direct erasure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that post-hoc concept removal in text-to-image diffusion models suffers from erase-retain imbalance because direct suppression damages shared capabilities while conservative updates leave targets recoverable. It proposes treating forgetting as retroactive interference: ascent on target prompts weakens unwanted behavior while descent on a diverse distractor set pushes the model toward multiple non-target outputs under the same prompt. A lightweight diagnostic then picks attention blocks by testing erase versus retain performance on generated images, exploiting the fact that suppression is easier to achieve broadly than retention. This combination is shown to outperform baselines on several unlearning benchmarks and models including Stable Diffusion variants. Readers care because deployed generative systems must satisfy copyright and safety demands without retraining from scratch or losing general usefulness.

Core claim

SurgUn instantiates retroactive concept interference via distractor-conditioned gradient competition, where target-gradient ascent weakens target-conditioned denoising while descent over semantically diverse distractors introduces competing non-target trajectories. Pixel-grounded weight-space localization selects attention blocks according to generated-image erase-retain behavior to limit collateral damage through shared pathways. Across UnlearnCanvas, IP-character erasure, Holistic Unlearning, EraseBench, and Ring-A-Bell on Stable Diffusion v1.5, SDXL, and SANA-1.5, this yields a stronger erase-retain balance than prior methods, with ablations confirming that diverse distractors, contrast,

What carries the argument

Distractor-conditioned gradient competition paired with pixel-grounded attention-block localization, where the first redistributes outputs across non-target modes and the second selects update sites by testing image-level erase versus retain outcomes.

If this is right

  • Target concepts become harder to recover via paraphrased, compositional, or adversarial prompts.
  • Capabilities for unrelated and related concepts remain more intact than with direct suppression or anchor-based methods.
  • The same procedure produces consistent gains on Stable Diffusion v1.5, SDXL, and SANA-1.5.
  • All three ingredients—diverse distractors, contrastive gradients, and localization—are required for the reported balance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The interference framing may transfer to unlearning in other generative architectures such as autoregressive image or video models.
  • Automated construction of distractor sets could become a reusable preprocessing step for any selective-forgetting pipeline.
  • The localization diagnostic might be extended to test multiple prompt styles automatically to reduce manual choices.
  • Connections to retroactive interference in cognitive psychology suggest possible cross-field experiments on how competition affects memory-like structures in networks.

Load-bearing premise

A reliable semantically diverse distractor set can be assembled and a quick pixel-grounded test will identify the right attention blocks without prompt-specific biases or heavy per-concept tuning.

What would settle it

Apply SurgUn to a held-out concept set where distractors are scarce or where new adversarial prompts still recover the target after unlearning; if recoverability remains high, the competition mechanism has not fully solved the problem.

read the original abstract

Deployed text-to-image diffusion models increasingly require post-hoc concept unlearning for copyright claims, artist opt-outs, safety updates, and protected-content mitigation without full retraining. A central challenge is erase-retain imbalance, aggressive updates suppress targets but damage shared capabilities, while conservative or anchor-based updates preserve quality yet leave concepts recoverable through related, compositional, paraphrased, or adversarial prompts. Inspired by retroactive interference, we propose SurgUn, which treats forgetting as controlled competition rather than direct deletion or one-to-one reassignment. SurgUn instantiates retroactive concept interference via distractor-conditioned gradient competition: target-gradient ascent weakens target-conditioned denoising or flow-matching behavior, while descent over a semantically diverse distractor set introduces competing non-target trajectories under the same prompt context. This redistributes outputs across multiple non-target modes instead of collapsing to a single proxy. To limit collateral forgetting through shared pathways, SurgUn adds pixel-grounded weight-space localization, a lightweight diagnostic that selects attention blocks by generated-image erase-retain behavior, exploiting the asymmetry that suppression is broadly achievable whereas retention is block-selective. Across UnlearnCanvas, IP-character erasure, Holistic Unlearning, EraseBench, and Ring-A-Bell on Stable Diffusion v1.5, SDXL, and SANA-1.5, SurgUn achieves a stronger erase-retain balance than baselines. Ablations show that diverse distractors, contrastive competition, and localization are all necessary for robust suppression while preserving related and unrelated concepts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SurgUn for post-hoc concept unlearning in text-to-image diffusion models. It reframes forgetting as retroactive interference via distractor-conditioned gradient competition (target ascent paired with descent on a semantically diverse distractor set) plus pixel-grounded localization that selects attention blocks by generated-image erase-retain behavior. The central claim is a superior erase-retain frontier versus baselines on UnlearnCanvas, IP-character erasure, Holistic Unlearning, EraseBench, and Ring-A-Bell across Stable Diffusion v1.5, SDXL, and SANA-1.5, with ablations showing necessity of the three components.

Significance. If the reported gains are reproducible without per-concept curation, the work supplies a useful new perspective on unlearning that redistributes probability mass across multiple non-target modes rather than collapsing to a single proxy. The multi-model, multi-benchmark evaluation and explicit ablations on distractor diversity, contrastive competition, and localization constitute clear strengths that would be retained even if the localization diagnostic requires refinement.

major comments (2)
  1. [§3.2] §3.2 (Distractor-conditioned competition): The construction of the 'semantically diverse distractor set' is listed as a free parameter with no explicit algorithm, selection criteria, or pseudocode. Without this, it is impossible to verify that the procedure is automatic and free of domain knowledge or prompt engineering, which is load-bearing for the claim that gains on EraseBench and Ring-A-Bell are not artifacts of careful curation.
  2. [§4.3, Table 5] §4.3 and Table 5 (Localization diagnostic): The pixel-grounded ranking of attention blocks is evaluated only on the same images and prompts used to compute the diagnostic. No cross-prompt or cross-composition stability test is reported, leaving open the possibility that block selection overfits to the diagnostic distribution and fails to generalize to the compositional or adversarial prompts emphasized in the abstract.
minor comments (2)
  1. [Abstract] The abstract asserts 'superior performance' and 'stronger erase-retain balance' without any numerical values, effect sizes, or error bars; adding at least the key metrics from the main tables would improve readability.
  2. [§3.1] Notation for the combined gradient update (target ascent + distractor descent) is described in prose; an explicit equation would clarify the contrastive competition term.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript to improve clarity and add supporting experiments where needed.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Distractor-conditioned competition): The construction of the 'semantically diverse distractor set' is listed as a free parameter with no explicit algorithm, selection criteria, or pseudocode. Without this, it is impossible to verify that the procedure is automatic and free of domain knowledge or prompt engineering, which is load-bearing for the claim that gains on EraseBench and Ring-A-Bell are not artifacts of careful curation.

    Authors: We agree that the current description of the distractor set construction in §3.2 is insufficiently explicit. The manuscript presents it as a free parameter, and we will revise this section to include a precise algorithm, selection criteria, and pseudocode. The updated description will demonstrate that the set is generated automatically from a broad, fixed collection of non-target concepts using a deterministic procedure that requires no per-concept curation, domain knowledge, or prompt engineering. This revision will directly support the reproducibility of results on EraseBench and Ring-A-Bell. revision: yes

  2. Referee: [§4.3, Table 5] §4.3 and Table 5 (Localization diagnostic): The pixel-grounded ranking of attention blocks is evaluated only on the same images and prompts used to compute the diagnostic. No cross-prompt or cross-composition stability test is reported, leaving open the possibility that block selection overfits to the diagnostic distribution and fails to generalize to the compositional or adversarial prompts emphasized in the abstract.

    Authors: We acknowledge that the manuscript does not report cross-prompt or cross-composition stability tests for the localization diagnostic in §4.3 and Table 5. While the diagnostic exploits a general asymmetry between broad suppression and block-selective retention observed in generated images, we agree that explicit stability evaluation would strengthen the claim of generalizability to compositional and adversarial prompts. We will add such tests in the revised version, applying the selected blocks to held-out prompts and compositions drawn from the evaluation benchmarks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new procedural elements evaluated externally

full rationale

The paper introduces SurgUn as a novel unlearning approach via distractor-conditioned gradient competition and pixel-grounded localization, explicitly framed as inspired by retroactive interference rather than derived from prior fitted quantities or self-citations. No equations or derivations in the provided text reduce target results to inputs by construction. Performance claims rest on evaluations across external benchmarks (UnlearnCanvas, EraseBench, Ring-A-Bell) and ablations showing component necessity, which constitute independent validation rather than self-referential fitting or renaming. The central erase-retain balance is not forced by definition or internal parameters but tested against baselines on held-out tasks. This qualifies as a self-contained method paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that retroactive interference can be instantiated via gradient competition in diffusion models and that attention blocks exhibit selective erase-retain behavior.

free parameters (2)
  • distractor set construction
    Selection of semantically diverse distractors is central to the competition mechanism and may involve heuristic or manual choices not fully specified.
  • localization diagnostic criteria
    The pixel-grounded selection of attention blocks depends on generated-image behavior thresholds that are likely tuned per experiment.
axioms (1)
  • domain assumption Retroactive interference from distractors can redistribute denoising trajectories away from target concepts
    Invoked in the description of distractor-conditioned gradient competition.

pith-pipeline@v0.9.0 · 5806 in / 1233 out tokens · 52160 ms · 2026-05-21T12:23:29.246950+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.