DGS-Net: Distillation-Guided Gradient Surgery for CLIP Fine-Tuning in AI-Generated Image Detection

Boyu Wang; Fan Wang; Jiazhen Yan; Zhangjie Fu; Ziqiang Li; Ziwen He

arxiv: 2511.13108 · v4 · pith:JPSMKDCKnew · submitted 2025-11-17 · 💻 cs.CV

DGS-Net: Distillation-Guided Gradient Surgery for CLIP Fine-Tuning in AI-Generated Image Detection

Jiazhen Yan , Ziqiang Li , Fan Wang , Boyu Wang , Ziwen He , Zhangjie Fu This is my paper

Pith reviewed 2026-05-21 18:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords AI-generated image detectionCLIP fine-tuninggradient surgerycatastrophic forgettingdistillation guidancegenerative modelsorthogonal projectionprior preservation

0 comments

The pith

DGS-Net prevents catastrophic forgetting when fine-tuning CLIP by projecting task gradients onto beneficial directions from a frozen encoder.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that fine-tuning large multimodal models like CLIP for detecting AI-generated images does not have to erase the model's original transferable knowledge. Standard fine-tuning often causes catastrophic forgetting that hurts generalization across different generative techniques such as GANs and diffusion models. DGS-Net addresses this by decomposing gradients during optimization to retain helpful directions and discard harmful ones. The method uses a frozen copy of the CLIP encoder to distill beneficial signals and applies orthogonal projection to remove directions that would degrade pre-trained priors. A sympathetic reader would care because effective detection must keep working as new image generators appear without requiring constant full retraining.

Core claim

The Distillation-guided Gradient Surgery Network (DGS-Net) introduces a gradient-space decomposition that separates harmful and beneficial descent directions. By projecting task gradients onto the orthogonal complement of harmful directions and aligning with beneficial ones distilled from a frozen CLIP encoder, DGS-Net achieves unified optimization of prior preservation and irrelevant suppression. Experiments across 50 generative models show an average performance gain of 6.6 over prior state-of-the-art approaches, with improved detection and cross-domain generalization.

What carries the argument

Gradient-space decomposition that projects task gradients onto the orthogonal complement of harmful directions while aligning with beneficial directions distilled from a frozen CLIP encoder.

If this is right

Outperforms state-of-the-art methods by an average margin of 6.6 across 50 generative models.
Preserves transferable pre-trained priors while suppressing task-irrelevant components.
Improves cross-domain generalization for synthetic image detection.
Unifies prior preservation and irrelevant suppression in a single optimization process.
Enables effective fine-tuning without requiring task-specific tuning for each new generator.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gradient decomposition could be tested on other vision-language models for tasks that suffer from forgetting.
If the separation holds, the technique might reduce retraining costs when entirely new generation methods appear.
The reliance on a frozen encoder raises the question of whether similar benefits appear in architectures without an obvious frozen reference copy.
Applying the method to real-world detection pipelines could reveal whether the reported gains persist under distribution shifts not covered in the 50-model test set.

Load-bearing premise

The decomposition in gradient space can reliably identify and remove harmful directions that cause forgetting using only orthogonal projection and signals from the frozen encoder.

What would settle it

Fine-tune a new DGS-Net instance on one group of generative models then measure whether detection accuracy on a disjoint group of models falls to the same level as ordinary fine-tuning instead of staying high.

Figures

Figures reproduced from arXiv: 2511.13108 by Boyu Wang, Fan Wang, Jiazhen Yan, Zhangjie Fu, Ziqiang Li, Ziwen He.

**Figure 1.** Figure 1: T-SNE Visualization of Features Extracted Using CLIP, CLIP-LoRA and Ours. Our method achieves strong real/fake discrimination while simultaneously preserving the prior knowledge embedded in the pre-trained model. These observations indicate that catastrophic forgetting in existing fine-tuning strategies compromises the transferable priors of the pre-trained model. To address these issues, we introduce a kn… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed Distillation-guided Gradient Surgery Network (DGS-Net). We introduce a gradient-space decomposition that separates harmful and beneficial descent directions during optimization. What’s more, it consists of two core components: Orthogonal Suppression and Prior Alignment, which aim to suppress task-irrelevant representations and preserve transferable priors established during large-s… view at source ↗

**Figure 3.** Figure 3: Classification using only text descriptions achieves an accuracy of approximately 60%. We used BLIP to convert images into textual descriptions and trained the detector directly on these text inputs. The resulting detection accuracy fluctuated around 60%, indicating that semantic information is partially correlated with the labels. However, most of them act as distractors that hinder cross-generator gener… view at source ↗

read the original abstract

The rapid progress of generative models such as GANs and diffusion models has led to the widespread proliferation of AI-generated images, raising concerns about misinformation, privacy violations, and trust erosion in digital media. Although large-scale multimodal models like CLIP offer strong transferable representations for detecting synthetic content, fine-tuning them often induces catastrophic forgetting, which degrades pre-trained priors and limits cross-domain generalization. To address this issue, we propose the Distillation-guided Gradient Surgery Network (DGS-Net), a novel framework that preserves transferable pre-trained priors while suppressing task-irrelevant components. Specifically, we introduce a gradient-space decomposition that separates harmful and beneficial descent directions during optimization. By projecting task gradients onto the orthogonal complement of harmful directions and aligning with beneficial ones distilled from a frozen CLIP encoder, DGS-Net achieves unified optimization of prior preservation and irrelevant suppression. Extensive experiments on 50 generative models demonstrate that our method outperforms state-of-the-art approaches by an average margin of 6.6, achieving superior detection performance and generalization across diverse generation techniques.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents DGS-Net, a Distillation-Guided Gradient Surgery Network for fine-tuning CLIP models in the task of AI-generated image detection. The core innovation is a gradient-space decomposition that projects task gradients onto the orthogonal complement of harmful directions (to preserve pre-trained priors and avoid catastrophic forgetting) and aligns them with beneficial directions distilled from a frozen CLIP encoder (to suppress task-irrelevant components). This is said to enable unified optimization. The method is evaluated on 50 generative models, outperforming SOTA by an average of 6.6 in detection performance and generalization.

Significance. Should the gradient decomposition prove robust, this work could have substantial impact on fine-tuning strategies for large multimodal models in detection and classification tasks. By addressing catastrophic forgetting through gradient surgery guided by distillation, it offers a potential general solution for maintaining transferable representations while adapting to new domains, which is a persistent challenge in computer vision applications involving generative content.

major comments (2)

[§3.2 (Gradient Surgery Mechanism)] §3.2 (Gradient Surgery Mechanism): The harmful direction vector is referenced but not defined with a specific equation or procedure (e.g., no mention of how it is computed from a forgetting loss, gradient similarity metric, or threshold). This definition is load-bearing for the central claim that the projection achieves reliable separation of catastrophic-forgetting components without introducing instabilities or requiring per-task tuning.
[§4 (Experiments)] §4 (Experiments): The reported average margin of 6.6 across 50 models lacks ablations isolating the contribution of the orthogonal projection step versus the distillation alignment, or any error analysis on direction separation. This makes it difficult to attribute gains specifically to the claimed unified optimization.

minor comments (2)

[Notation and Method] The notation for the projection operator and the distillation loss could be formalized more explicitly to support reproducibility.
[Figure 2 or Method] Consider adding a small diagram illustrating the gradient decomposition in the method section for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the paper to improve clarity and experimental rigor where the concerns are valid.

read point-by-point responses

Referee: [§3.2 (Gradient Surgery Mechanism)] §3.2 (Gradient Surgery Mechanism): The harmful direction vector is referenced but not defined with a specific equation or procedure (e.g., no mention of how it is computed from a forgetting loss, gradient similarity metric, or threshold). This definition is load-bearing for the central claim that the projection achieves reliable separation of catastrophic-forgetting components without introducing instabilities or requiring per-task tuning.

Authors: We agree that an explicit definition and computation procedure for the harmful direction vector is necessary for reproducibility and to support the central claim. In the revised manuscript, Section 3.2 now includes a new Equation (3) that defines the harmful direction as the normalized component of the task gradient that exhibits high cosine similarity (above a fixed threshold of 0.7) to the gradient of a forgetting loss evaluated on a small held-out subset of the original CLIP pre-training data. The projection is then performed onto the orthogonal complement of this vector. This procedure uses a single fixed threshold across all tasks and does not require per-task tuning, as verified in our experiments. revision: yes
Referee: [§4 (Experiments)] §4 (Experiments): The reported average margin of 6.6 across 50 models lacks ablations isolating the contribution of the orthogonal projection step versus the distillation alignment, or any error analysis on direction separation. This makes it difficult to attribute gains specifically to the claimed unified optimization.

Authors: We acknowledge that the original experiments did not sufficiently isolate the individual contributions of the orthogonal projection and distillation alignment steps. In the revised Section 4.3, we have added targeted ablations on a representative subset of 10 generative models. These show that removing the orthogonal projection reduces the average improvement from 6.6 to 3.1, while removing the distillation alignment reduces it to 3.4; the combined method yields the full gain. We also report standard deviations over 5 random seeds and include a plot of the average cosine similarity between the identified harmful and beneficial directions (consistently below 0.15), supporting the stability of the separation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained with external references

full rationale

The abstract describes a gradient-space decomposition via orthogonal projection of task gradients and alignment with distillation signals from a frozen external CLIP encoder. The performance claim (6.6 average margin on 50 generative models) is tied to experimental results rather than any fitted parameter or self-referential definition inside the method. No equations are provided that reduce the claimed unified optimization to inputs by construction, and no self-citation, ansatz smuggling, or renaming of known results is evident in the given text. The central premise references an independent frozen encoder and reports cross-model validation, satisfying the criteria for a self-contained derivation without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that harmful and beneficial gradient directions can be separated via orthogonal projection and distillation; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Gradient-space decomposition separates harmful and beneficial descent directions during optimization.
Central to the projection step described in the abstract.

pith-pipeline@v0.9.0 · 5731 in / 1175 out tokens · 49955 ms · 2026-05-21T18:46:46.065106+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we define the positive part of the gradient as g+ ≜ [∇uL]+ ... harmful direction ... g− ≜ [∇uL]− ... beneficial direction ... ˜g ≜ (I − ĝharm ĝharm⊤) gtask

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages

[1]

CNN-Spot employs convolutional neural networks to detect synthetic content by analyzing common spatial artifacts in AI-generated images

CNN-Spot(CVPR 2020) (Wang et al., 2020). CNN-Spot employs convolutional neural networks to detect synthetic content by analyzing common spatial artifacts in AI-generated images. It captures hierarchical features directly from pixel data, enabling the detection of generation anomalies

work page 2020
[2]

UnivFD demonstrates that CLIP can effectively extract artifacts from images

UnivFD(CVPR 2023) (Ojha et al., 2023). UnivFD demonstrates that CLIP can effectively extract artifacts from images. By training a classifier on these features, it achieves strong cross-generator generalization performance

work page 2023
[3]

FreqNet isolates high-frequency components of images via an FFT-based high-pass filter and introduces a frequency-domain learning block

FreqNet(AAAI 2024) (Tan et al., 2024a). FreqNet isolates high-frequency components of images via an FFT-based high-pass filter and introduces a frequency-domain learning block. This block transforms intermediate feature maps using FFT, applies learnable magnitude and phase adjustments, and reconstructs them with iFFT, enabling direct optimization in the f...

work page 2024
[4]

NPR targets structural artifacts introduced by up-sampling layers in generative models

NPR(CVPR 2024) (Tan et al., 2024b). NPR targets structural artifacts introduced by up-sampling layers in generative models. It transforms input images into NPR maps that capture signed intensity differences between each pixel and its neighbors, explicitly revealing local dependency patterns characteristic of synthetic up-sampling operations

work page 2024
[5]

Ladeda(arxiv 2024) (Cavia et al., 2024). LaDeDa is a patch-level deepfake detector that partitions each input image into 9 × 9 pixel patches and processes them using a BagNet-style ResNet-50 variant with its receptive field constrained to the same 9 × 9 region. The model assigns a deepfake likelihood to each patch, and the final prediction is obtained by ...

work page 2024
[6]

AIDE combines low-level patch statistics with high-level semantics for AI-generated image detection

AIDE(ICLR 2025) (Yan et al., 2024a). AIDE combines low-level patch statistics with high-level semantics for AI-generated image detection. It employs two expert branches: a semantic branch, which leverages CLIP-ConvNeXt embeddings to detect content inconsistencies, and a patchwise branch, which selects patches by spectral energy and applies a lightweight C...

work page 2025
[7]

C2P-CLIP concludes that CLIP achieves classification by matching similar concepts rather than discerning true and false

C2P-CLIP(AAAI 2025) (Tan et al., 2025). C2P-CLIP concludes that CLIP achieves classification by matching similar concepts rather than discerning true and false. Based on this conclusion, they propose category common prompts to fine-tune the image encoder by manually constructing category concepts combined with contrastive learning

work page 2025
[8]

DFFreq(arxiv 2025) (Yan et al., 2025a). DFFreq first utilizes a sliding window to restrict the attention mechanism to a local window, and reconstruct the features within the window to model the relationships between neighboring internal elements within the local region. Then, a dual frequency domain branch framework consisting of four frequency domain sub...

work page 2025
[9]

SAFE(KDD 2025) (Li et al., 2025b). SAFE replaces conventional resizing with random cropping to better preserve high-frequency details, applies data augmentations such as Color-Jitter and RandomRotation to break correlations tied to color and layout, and introduces patch-level random masking to encourage the model to focus on localized regions where synthe...

work page 2025
[10]

Effort(ICML 2025) (Yan et al., 2024b). Effort find that a naively trained detector very quickly shortcuts to the seen fake patterns, collapsing the feature space into a low-ranked structure that limits expressivity and generalization. Thus, they decompose the feature space into two orthogonal subspaces, for preserving pre-trained knowledge while learning forgery

work page 2025
[11]

NS-Net(arxiv 2025) (Yan et al., 2025b). NS-Net uses the feature homogeneity extracted by the text encoder to replace the semantic information of the features extracted by the image encoder, and uses NULL-Space to decouple the semantic information, retaining the artifact information related to the forgery detection task. 13 DGS-Net: Distillation-Guided Gra...

work page arXiv 2025

[1] [1]

CNN-Spot employs convolutional neural networks to detect synthetic content by analyzing common spatial artifacts in AI-generated images

CNN-Spot(CVPR 2020) (Wang et al., 2020). CNN-Spot employs convolutional neural networks to detect synthetic content by analyzing common spatial artifacts in AI-generated images. It captures hierarchical features directly from pixel data, enabling the detection of generation anomalies

work page 2020

[2] [2]

UnivFD demonstrates that CLIP can effectively extract artifacts from images

UnivFD(CVPR 2023) (Ojha et al., 2023). UnivFD demonstrates that CLIP can effectively extract artifacts from images. By training a classifier on these features, it achieves strong cross-generator generalization performance

work page 2023

[3] [3]

FreqNet isolates high-frequency components of images via an FFT-based high-pass filter and introduces a frequency-domain learning block

FreqNet(AAAI 2024) (Tan et al., 2024a). FreqNet isolates high-frequency components of images via an FFT-based high-pass filter and introduces a frequency-domain learning block. This block transforms intermediate feature maps using FFT, applies learnable magnitude and phase adjustments, and reconstructs them with iFFT, enabling direct optimization in the f...

work page 2024

[4] [4]

NPR targets structural artifacts introduced by up-sampling layers in generative models

NPR(CVPR 2024) (Tan et al., 2024b). NPR targets structural artifacts introduced by up-sampling layers in generative models. It transforms input images into NPR maps that capture signed intensity differences between each pixel and its neighbors, explicitly revealing local dependency patterns characteristic of synthetic up-sampling operations

work page 2024

[5] [5]

Ladeda(arxiv 2024) (Cavia et al., 2024). LaDeDa is a patch-level deepfake detector that partitions each input image into 9 × 9 pixel patches and processes them using a BagNet-style ResNet-50 variant with its receptive field constrained to the same 9 × 9 region. The model assigns a deepfake likelihood to each patch, and the final prediction is obtained by ...

work page 2024

[6] [6]

AIDE combines low-level patch statistics with high-level semantics for AI-generated image detection

AIDE(ICLR 2025) (Yan et al., 2024a). AIDE combines low-level patch statistics with high-level semantics for AI-generated image detection. It employs two expert branches: a semantic branch, which leverages CLIP-ConvNeXt embeddings to detect content inconsistencies, and a patchwise branch, which selects patches by spectral energy and applies a lightweight C...

work page 2025

[7] [7]

C2P-CLIP concludes that CLIP achieves classification by matching similar concepts rather than discerning true and false

C2P-CLIP(AAAI 2025) (Tan et al., 2025). C2P-CLIP concludes that CLIP achieves classification by matching similar concepts rather than discerning true and false. Based on this conclusion, they propose category common prompts to fine-tune the image encoder by manually constructing category concepts combined with contrastive learning

work page 2025

[8] [8]

DFFreq(arxiv 2025) (Yan et al., 2025a). DFFreq first utilizes a sliding window to restrict the attention mechanism to a local window, and reconstruct the features within the window to model the relationships between neighboring internal elements within the local region. Then, a dual frequency domain branch framework consisting of four frequency domain sub...

work page 2025

[9] [9]

SAFE(KDD 2025) (Li et al., 2025b). SAFE replaces conventional resizing with random cropping to better preserve high-frequency details, applies data augmentations such as Color-Jitter and RandomRotation to break correlations tied to color and layout, and introduces patch-level random masking to encourage the model to focus on localized regions where synthe...

work page 2025

[10] [10]

Effort(ICML 2025) (Yan et al., 2024b). Effort find that a naively trained detector very quickly shortcuts to the seen fake patterns, collapsing the feature space into a low-ranked structure that limits expressivity and generalization. Thus, they decompose the feature space into two orthogonal subspaces, for preserving pre-trained knowledge while learning forgery

work page 2025

[11] [11]

NS-Net(arxiv 2025) (Yan et al., 2025b). NS-Net uses the feature homogeneity extracted by the text encoder to replace the semantic information of the features extracted by the image encoder, and uses NULL-Space to decouple the semantic information, retaining the artifact information related to the forgery detection task. 13 DGS-Net: Distillation-Guided Gra...

work page arXiv 2025