Fragile Reconstruction: Adversarial Vulnerability of Reconstruction-Based Detectors for Diffusion-Generated Images
Pith reviewed 2026-05-10 14:46 UTC · model grok-4.3
The pith
Imperceptible adversarial perturbations can collapse the accuracy of reconstruction-based detectors for diffusion-generated images to near zero.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reconstruction-based detectors for diffusion-generated images exhibit severe security vulnerabilities: adding imperceptible adversarial perturbations to input images causes detection accuracy to collapse to near zero across three representative detectors and four generative backbones. The attacks succeed in white-box scenarios, transfer between detectors to enable black-box attacks, and resist standard countermeasures, which the authors link to the low signal-to-noise ratio of the attacked samples as seen by the detectors.
What carries the argument
Adversarial perturbation crafting that targets the reconstruction step and exploits the resulting low signal-to-noise ratio perceived by the detector.
If this is right
- All evaluated detectors lose effectiveness under white-box adversarial attacks.
- Attacks transfer across detectors, enabling construction in black-box settings.
- Standard adversarial defenses give only limited protection.
- The low signal-to-noise ratio of attacked samples explains why current detectors fail.
Where Pith is reading between the lines
- Detection approaches may need to incorporate explicit robustness testing against small input changes rather than relying solely on reconstruction quality.
- If the vulnerability stems from the reconstruction mechanism itself, similar issues could appear in other generative-image detectors that use comparable pipelines.
Load-bearing premise
The three tested detectors stand in for the wider set of reconstruction-based methods and the perturbations stay imperceptible and practical in real conditions.
What would settle it
Measure whether detection accuracy remains near zero when the same attack method is applied to a new reconstruction-based detector trained on different data or architectures.
Figures
read the original abstract
Recently, detecting AI-generated images produced by diffusion-based models has attracted increasing attention due to their potential threat to safety. Among existing approaches, reconstruction-based methods have emerged as a prominent paradigm for this task. However, we find that such methods exhibit severe security vulnerabilities to adversarial perturbations; that is, by adding imperceptible adversarial perturbations to input images, the detection accuracy of classifiers collapses to near zero. To verify this threat, we present a systematic evaluation of the adversarial robustness of three representative detectors across four diverse generative backbone models. First, we construct adversarial attacks in white-box scenarios, which degrade the performance of all well-trained detectors. Moreover, we find that these attacks demonstrate transferability; specifically, attacks crafted against one detector can be transferred to others, indicating that adversarial attacks on detectors can also be constructed in a black-box setting. Finally, we assess common countermeasures and find that standard defense methods against adversarial attacks provide limited mitigation. We attribute these failures to the low signal-to-noise ratio (SNR) of attacked samples as perceived by the detectors. Overall, our results reveal fundamental security limitations of reconstruction-based detectors and highlight the need to rethink existing detection strategies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that reconstruction-based detectors for diffusion-generated images are severely vulnerable to adversarial perturbations. By adding imperceptible perturbations, detection accuracy collapses to near zero across three representative detectors evaluated on four generative models. White-box attacks succeed, attacks transfer across detectors enabling black-box scenarios, standard defenses offer limited mitigation, and the failures are attributed to low SNR of attacked samples as perceived by the detectors.
Significance. If the empirical results hold, the finding is significant because it identifies a practical security limitation in a prominent detection paradigm for AI-generated content, with timely implications for deployment. The systematic scope—multiple detectors, generators, white-box/transfer/defense tests—provides concrete evidence that could guide future robust designs. The work earns credit for its empirical breadth and falsifiable predictions about attack success rates.
major comments (3)
- [Abstract and §3] Abstract and §3 (Detector Selection): The central claim that 'such methods exhibit severe security vulnerabilities' generalizes from only three detectors. Without explicit criteria for representativeness or evaluation of additional reconstruction-based variants in §3, it remains unclear whether the observed collapse is paradigm-wide or tied to shared architectural motifs in the chosen implementations.
- [§5.2] §5.2 (Transferability Experiments): Transfer success is reported, but without the number of independent runs, standard deviations, or statistical tests on the accuracy drops, the reliability of the black-box transfer claim is difficult to assess and weakens support for the security-vulnerability conclusion.
- [§6] §6 (Defense Assessment): The statement that 'standard defense methods against adversarial attacks provide limited mitigation' is load-bearing for the final recommendation to rethink strategies, yet lacks quantitative before/after metrics or ablation on which defenses were tested and why they failed.
minor comments (2)
- The abstract mentions low SNR but does not define how SNR is computed for the detectors; add a brief equation or procedure in the main text or appendix.
- [Figures] Table captions and axis labels in experimental result figures should explicitly state the generators and attack strengths used for each row/column.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the scope and rigor of our empirical claims. We address each major point below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Detector Selection): The central claim that 'such methods exhibit severe security vulnerabilities' generalizes from only three detectors. Without explicit criteria for representativeness or evaluation of additional reconstruction-based variants in §3, it remains unclear whether the observed collapse is paradigm-wide or tied to shared architectural motifs in the chosen implementations.
Authors: We agree that the manuscript would benefit from greater transparency on detector selection. In the revised version, we will expand §3 to explicitly state the criteria used: prominence in recent literature, coverage of distinct reconstruction architectures (e.g., autoencoder-based, diffusion-inversion-based, and hybrid), and public availability of trained models. We will also add a short discussion acknowledging that while the three detectors are representative of the dominant paradigms, the results may not cover every possible variant; however, the consistent failure mode across them supports our broader security concern. No additional experiments are planned at this stage due to computational constraints, but we will frame the claims more cautiously. revision: partial
-
Referee: [§5.2] §5.2 (Transferability Experiments): Transfer success is reported, but without the number of independent runs, standard deviations, or statistical tests on the accuracy drops, the reliability of the black-box transfer claim is difficult to assess and weakens support for the security-vulnerability conclusion.
Authors: We accept this criticism. The original experiments were run with multiple random seeds, but the details were omitted for brevity. In the revision, we will report the exact number of independent runs (five per transfer pair), include standard deviations on the reported accuracy drops, and add paired t-test results to establish statistical significance of the observed transferability. These additions will be placed in §5.2 and the corresponding tables. revision: yes
-
Referee: [§6] §6 (Defense Assessment): The statement that 'standard defense methods against adversarial attacks provide limited mitigation' is load-bearing for the final recommendation to rethink strategies, yet lacks quantitative before/after metrics or ablation on which defenses were tested and why they failed.
Authors: We will revise §6 to address this directly. The updated section will include a table with before-and-after detection accuracies for each tested defense (adversarial training, JPEG compression, and Gaussian smoothing), along with an ablation study showing the effect of defense strength hyperparameters. We will also expand the discussion of why these methods fail, linking it quantitatively to the low-SNR observation already present in the paper. This will make the limited-mitigation claim fully supported by data. revision: yes
Circularity Check
No circularity: purely empirical evaluation with independent experimental results
full rationale
The paper conducts a systematic empirical evaluation of adversarial attacks on three specific reconstruction-based detectors across four generative models. It reports white-box attack success, transferability to black-box settings, and limited effectiveness of standard defenses, attributing failures to low SNR. No derivations, equations, fitted parameters, or self-citations are used to derive the central claims; results follow directly from the described attack constructions and accuracy measurements on held-out data. The representativeness concern raised in the skeptic note is a question of external validity, not a reduction of the reported findings to their own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Adversarial perturbations exist that can fool neural classifiers while remaining imperceptible
Reference graph
Works this paper leans on
-
[1]
Black Forest Labs: FLUX.1 (2024) 2, 3, 7, 11, 18
work page 2024
-
[2]
Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al.: Video generation models as world simulators (2024) 2
work page 2024
-
[3]
In: IEEE Computer Society (2017) 3, 6
Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: IEEE Computer Society (2017) 3, 6
work page 2017
-
[4]
Chen, R.T., Rubanova, Y., Bettencourt, J., Duvenaud, D.K.: Neural ordinary dif- ferential equations. In: NeurIPS (2018) 4, 6
work page 2018
-
[5]
Chu, B., Xu, X., Wang, X., Zhang, Y., You, W., Zhou, L.: FIRE: robust detection of diffusion-generated images via frequency-guided reconstruction error. In: CVPR (2025) 3
work page 2025
-
[6]
Corvi, R., Cozzolino, D., Zingarini, G., Poggi, G., Nagano, K., Verdoliva, L.: On the detection of synthetic images generated by diffusion models. In: ICASSP (2023) 2, 3
work page 2023
-
[7]
Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML (2020) 2, 3, 7
work page 2020
-
[8]
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR (2009) 3, 7
work page 2009
-
[9]
In: NeurIPS (2021) 2, 3, 7, 18
Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. In: NeurIPS (2021) 2, 3, 7, 18
work page 2021
-
[10]
Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: ICLR (2015) 2, 3, 6
work page 2015
-
[11]
Gu, S., Chen, D., Bao, J., Wen, F., Zhang, B., Chen, D., Yuan, L., Guo, B.: Vector quantized diffusion model for text-to-image synthesis. In: CVPR (2022) 2, 7, 11, 18
work page 2022
-
[12]
He, S., Li, X., Yang, X., Xiong, Y., Li, K.: GRRE: leveraging g-channel removed reconstruction error for robust detection of ai-generated images. Preprint (2026) 3
work page 2026
-
[13]
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross-attention control. In: ICLR (2023) 4
work page 2023
-
[14]
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020) 2, 3 16 H. Jiang et al
work page 2020
-
[15]
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. In: NeurIPS (2022) 2
work page 2022
-
[16]
Kang, J.Y., Park, J., Kim, S., Yoon, J.W., Kim, N.S.: Semantic-aware reconstruc- tion error for detecting ai-generated images. Preprint (2025) 3
work page 2025
-
[17]
In: CVPR (2024) 2, 3, 5, 6, 7, 18
Luo, Y., Du, J., Yan, K., Ding, S.: Lare2: Latent reconstruction error based method for diffusion-generated image detection. In: CVPR (2024) 2, 3, 5, 6, 7, 18
work page 2024
-
[18]
Ma, R., Duan, J., Kong, F., Shi, X., Xu, K.: Exposing the fake: Effective diffusion- generated images detection. Preprint (2023) 3
work page 2023
-
[19]
Ma, X., Wang, Y., Chen, X., Jia, G., Liu, Z., Li, Y.F., Chen, C., Qiao, Y.: Latte: Latent diffusion transformer for video generation. TMLR (2025) 2
work page 2025
-
[20]
In: ICLR (2018) 2, 3, 6, 11, 12
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018) 2, 3, 6, 11, 12
work page 2018
-
[21]
Mirsky, Y., Lee, W.: The creation and detection of deepfakes: A survey. ACM Comput. Surv. (2021) 2, 3
work page 2021
-
[22]
Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inversion for editing real images using guided diffusion models. In: CVPR (2023) 4
work page 2023
-
[23]
Nie, W., Guo, B., Huang, Y., Xiao, C., Vahdat, A., Anandkumar, A.: Diffusion models for adversarial purification. In: ICML (2022) 2, 3, 11
work page 2022
-
[24]
Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: ICCV (2023) 3
work page 2023
-
[25]
In: CVPR (2024) 2, 3, 5, 6, 7, 18
Ricker, J., Lukovnikov, D., Fischer, A.: AEROBLADE: training-free detection of latent diffusion images using autoencoder reconstruction error. In: CVPR (2024) 2, 3, 5, 6, 7, 18
work page 2024
-
[26]
In: CVPR (2022) 2, 3, 5, 7, 18
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022) 2, 3, 5, 7, 18
work page 2022
-
[27]
Somepalli, G., Singla, V., Goldblum, M., Geiping, J., Goldstein, T.: Diffusion art or digital forgery? investigating data replication in diffusion models. In: CVPR (2023) 2, 3
work page 2023
-
[28]
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021) 2, 3, 4
work page 2021
-
[29]
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. In: ICLR (2021) 3, 4, 11
work page 2021
-
[30]
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. In: ICLR (2014) 2, 3
work page 2014
-
[31]
Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., Madry, A.: Robustness may be at odds with accuracy (2019) 13
work page 2019
-
[32]
Social media + society (2020) 2, 3
Vaccari, C., Chadwick, A.: Deepfakes and disinformation: Exploring the impact of synthetic political video on deception, uncertainty, and trust in news. Social media + society (2020) 2, 3
work page 2020
-
[33]
Vasilcoiu, A., Najdenkoska, I., Geradts, Z., Worring, M.: LATTE: latent trajectory embedding for diffusion-generated image detection. Preprint (2025) 3
work page 2025
-
[34]
Wang, R., Yi, M., Chen, Z., Zhu, S.: Out-of-distribution generalization with causal invariant transformations. In: CVPR (2022) 8
work page 2022
-
[35]
Wang, S.Y., Wang, O., Zhang, R., Owens, A., Efros, A.A.: Cnn-generated images are surprisingly easy to spot... for now. In: CVPR (2020) 2, 3
work page 2020
-
[36]
Wang, Y., Zou, D., Yi, J., Bailey, J., Ma, X., Gu, Q.: Improving adversarial ro- bustness requires revisiting misclassified examples. In: ICLR (2020) 3
work page 2020
-
[37]
In: ICLR (2025) 3 Fragile Reconstruction 17
Wang, Z., Yi, M., Xue, S., Li, Z., Liu, M., Qin, B., Ma, Z.M.: Improved diffusion- based generative model with better adversarial robustness. In: ICLR (2025) 3 Fragile Reconstruction 17
work page 2025
-
[38]
In: ICCV (2023) 2, 3, 5, 6, 7, 18
Wang, Z., Bao, J., Zhou, W., Wang, W., Hu, H., Chen, H., Li, H.: DIRE for diffusion-generated image detection. In: ICCV (2023) 2, 3, 5, 6, 7, 18
work page 2023
-
[39]
Yi, M., Hou, L., Shang, L., Jiang, X., Liu, Q., Ma, Z.M.: Reweighting augmented samples by minimizing the maximal expected loss. In: ICLR (2021) 3
work page 2021
-
[40]
Yi, M., Hou, L., Sun, J., Shang, L., Jiang, X., Liu, Q., Ma, Z.: Improved OOD generalization via adversarial training and pretraing. In: ICML (2021) 3
work page 2021
-
[41]
Yi, M., Li, A., Xin, Y., Li, Z.: Towards understanding the working mechanism of text-to-image diffusion model. In: NeurIPS (2024) 3
work page 2024
-
[42]
Yi, M., Sun, J., Li, Z.: On the generalization of diffusion model. Preprint (2023) 3
work page 2023
-
[43]
Yi, M., Wang, R., Sun, J., Li, Z., Ma, Z.M.: Breaking correlation shift via condi- tional invariant regularizer. In: ICLR (2023) 8
work page 2023
-
[44]
Zhang, H., Yu, Y., Jiao, J., Xing, E., El Ghaoui, L., Jordan, M.: Theoretically principled trade-off between robustness and accuracy. In: ICML (2019) 3, 13
work page 2019
-
[45]
Fake” (0%Real), whereas AEROBLADE uniformly defaults to predicting 100%“Real
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018) 5 18 H. Jiang et al. A Appendix A.1 Details of Generative Models To ensure the comprehensive diversity of our benchmark, we select four gen- erative models that exemplify distinct evolutionary stages and archi...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.