pith. sign in

arxiv: 2606.20488 · v1 · pith:WCSDDDC3new · submitted 2026-06-18 · 💻 cs.CV

How Fragile Are Training-Free AI-Generated Image Detectors? A Controlled Audit of Score Direction, Preprocessing, and Compression

Pith reviewed 2026-06-26 18:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords training-free detectionAI-generated imagesrobustness auditpreprocessing sensitivityhyperparameter effectsJPEG compressionbenchmark evaluationscore direction
0
0 comments X

The pith

Implementation choices swing training-free AI image detector AUROC by up to 0.38 and can reverse score directions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs a controlled audit of two training-free detection scores on a shared 1,500-image benchmark covering seven generators and JPEG compression. It shows that swapping the feature backbone or the resize strategy produces AUROC shifts large enough to change which method looks better and even which generators it works on. The direction of one score itself flips with the noise level chosen, and missing re-encoding of compressed images creates false robustness signals. These results indicate that many published differences between detectors trace to inconsistent setups rather than to the methods themselves.

Core claim

Replacing the LPIPS backbone changes overall AUROC by +0.085. Switching resize-to-512 versus native resolution flips per-generator conclusions by up to 0.38 AUROC. The RIGID-style score inverts on SD1.5 and Wukong at sigma=0.05, recovers above 0.5 for every generator at sigma=0.01, and drops to 0.15 at sigma=0.3. Without unified re-encoding, AUROC under JPEG-50 exceeds the clean condition for the AlexNet reconstruction score, but after correction the residual effect localizes to BigGAN alone. The two audited scores have complementary per-generator failure sets, yet naive fusion does not improve on the best single score.

What carries the argument

The controlled audit that applies identical preprocessing, hyperparameter sweeps, and re-encoding steps to AEROBLADE-style reconstruction scores and RIGID-style noise-perturbation scores on the GenImage-derived benchmark.

If this is right

  • Backbone choice alone can shift reported detector performance by amounts comparable to claimed method gains.
  • Preprocessing decisions can invert which generator a given score detects reliably.
  • Score direction is controlled by the noise level hyperparameter rather than being fixed by the method.
  • Apparent robustness to JPEG compression can disappear once images are re-encoded under a common protocol.
  • Complementary failure patterns exist across scores, but simple averaging does not exploit them.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Any new training-free detector should be tested across at least two backbones and two resolutions before claims of superiority are made.
  • Published robustness numbers for compression should be treated as provisional until the re-encoding step is verified.
  • The audit protocol itself could be applied to other published training-free scores to check whether their reported rankings hold.
  • Detectors that remain stable across the tested variations would be stronger candidates for deployment than those that flip with small changes.

Load-bearing premise

That the 1,500-image GenImage-derived benchmark spanning seven generators is representative enough for the fragility conclusions to apply beyond the tested generators, resolutions, and compression levels.

What would settle it

Re-running the identical audit protocol on a fresh set of generators or at native resolutions outside the tested range and finding AUROC changes below 0.05 for the same backbone and sigma variations.

read the original abstract

Training-free detectors of AI-generated images promise generator-agnostic deployment without classifier training, yet their reported numbers are rarely compared under a single controlled protocol. We audit two representative training-free scores -- an autoencoder-reconstruction score (AEROBLADE-style) and a noise-perturbation feature-similarity score (RIGID-style) -- plus a naive feature-kNN control, on a common 1,500-image GenImage-derived benchmark spanning seven generators and JPEG compression at quality 70 and 50. The audit yields three cautionary findings. (i) Implementation details masquerade as method differences: replacing the LPIPS backbone (AlexNet -> VGG-16) changes overall AUROC by +0.085, and switching between resize-to-512 and native-resolution preprocessing flips per-generator conclusions by up to 0.38 AUROC. (ii) Score direction is not a property of the method but of its hyperparameters: the RIGID-style score is inverted (AUROC < 0.5) on SD1.5 and Wukong at noise level sigma=0.05, recovers to >0.5 for every generator at sigma=0.01, and collapses to 0.15 at sigma=0.3. (iii) Dataset format bias inflates robustness claims: without unified re-encoding, AUROC under JPEG-50 exceeds the clean condition for the AlexNet-backbone reconstruction score; after bias correction the residual anomaly localizes to a single generator (BigGAN). The audited scores have complementary per-generator failure sets, but naive z-score fusion does not beat the best single score, indicating that exploiting complementarity requires direction-aware combination.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper conducts a controlled audit of two training-free AI-generated image detection scores (AEROBLADE-style reconstruction and RIGID-style noise-perturbation) plus a feature-kNN control on a shared 1,500-image GenImage-derived benchmark spanning seven generators, explicitly varying LPIPS backbone, resize vs. native preprocessing, noise sigma, and JPEG quality levels. It reports that implementation choices produce AUROC shifts up to 0.085 overall and 0.38 per-generator, that RIGID-style score direction inverts with sigma (e.g., <0.5 at 0.05 on SD1.5/Wukong but >0.5 at 0.01), and that unified re-encoding corrects an apparent robustness inflation under JPEG-50, localizing the residual anomaly to BigGAN; naive z-score fusion does not outperform the best single score.

Significance. If the measured effects hold, the audit is significant for highlighting that apparent performance gaps between training-free detectors can arise from uncontrolled implementation details and hyperparameters rather than core methodological differences, supported by direct AUROC measurements on a fixed benchmark rather than extrapolation. The work earns credit for enforcing identical data and protocols across all comparisons and for correcting one identified dataset bias.

major comments (1)
  1. [Abstract and benchmark description] Abstract and benchmark description: the cautionary claims that implementation details 'masquerade as method differences' and that scores are fragile rest on the 1,500-image GenImage-derived set with seven generators being representative; while the paper directly measures the reported deltas (e.g., +0.085 from AlexNet to VGG, 0.38 preprocessing flips, sigma-dependent inversion) within this benchmark and corrects the JPEG bias, it provides no evidence that the same magnitude or direction of effects appear for generators or operating points outside GenImage, which is load-bearing for the title-level claim of general fragility.
minor comments (1)
  1. The abstract states that 'naive z-score fusion does not beat the best single score' but does not specify how the z-scores are computed or whether direction is accounted for before fusion; adding one sentence in the experimental protocol would clarify the negative result.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for acknowledging the controlled protocol and bias correction. We address the single major comment below and will revise the manuscript to better scope our claims.

read point-by-point responses
  1. Referee: [Abstract and benchmark description] Abstract and benchmark description: the cautionary claims that implementation details 'masquerade as method differences' and that scores are fragile rest on the 1,500-image GenImage-derived set with seven generators being representative; while the paper directly measures the reported deltas (e.g., +0.085 from AlexNet to VGG, 0.38 preprocessing flips, sigma-dependent inversion) within this benchmark and corrects the JPEG bias, it provides no evidence that the same magnitude or direction of effects appear for generators or operating points outside GenImage, which is load-bearing for the title-level claim of general fragility.

    Authors: We agree that the empirical results are specific to the GenImage-derived 1,500-image benchmark spanning seven generators and do not constitute evidence for identical effect sizes on other datasets or generators. The manuscript frames the contribution as a controlled audit rather than a universal demonstration; the title poses a question and the abstract reports 'cautionary findings' from this audit. To address the concern, we will (i) revise the abstract to state explicitly that the measured AUROC variations and inversions are observed within this benchmark, (ii) add a dedicated limitations paragraph noting that extrapolation beyond GenImage remains untested, and (iii) adjust phrasing in the introduction and conclusion to avoid implying broader generality. These changes preserve the value of the direct, apples-to-apples comparison on a standard benchmark while removing any overclaim. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical audit on external benchmark

full rationale

The paper conducts a controlled empirical comparison of existing training-free detector scores (AEROBLADE-style, RIGID-style, and a kNN control) on a fixed 1,500-image GenImage-derived benchmark. All reported AUROC deltas, score inversions, and preprocessing effects are direct measurements against this external test set rather than quantities derived from the paper's own fitted parameters or self-referential definitions. No derivation chain, uniqueness theorem, or ansatz is invoked; the central claims are observational and falsifiable on the stated benchmark. Self-citation load-bearing and fitted-input-called-prediction patterns are absent.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The audit uses only standard AUROC computation and a fixed external benchmark; no new free parameters are fitted to produce the fragility claims.

axioms (1)
  • standard math AUROC is an appropriate scalar summary of detector ranking quality across generators
    Invoked when reporting all performance numbers.

pith-pipeline@v0.9.1-grok · 5849 in / 1248 out tokens · 32065 ms · 2026-06-26T18:00:40.805106+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 3 linked inside Pith

  1. [1]

    INTRODUCTION The rapid progress of generative models—from GANs [1, 2] to dif- fusion models [3, 4] and text-conditional systems such as GLIDE, la- tent diffusion, VQ-diffusion, and SDXL [5, 6, 7, 8]—has made syn- thetic imagery a first-order forensic concern. The classical response istraining-based: convolutional classifiers trained on real/fake pairs gen...

  2. [2]

    AUDITED METHODS AND PROTOCOL 2.1. Benchmark We sample 1,500 images from the validation portion of a GenIm- age [29] repackaging 1: 800 real images (ImageNet) and 700 fakes, 100 each from seven generators (ADM [4], BigGAN [2], GLIDE [5], Midjourney, SD1.5 [6], VQDM [7], Wukong), drawn with a fixed seed (42). We report threshold-free AUROC with fake as the ...

  3. [3]

    real images are more robust to perturbation

    RESULTS 3.1. Backbone and preprocessing dominate the headline num- ber Table 1 shows the original-pipeline matrix. Two observations. First, swapping the LPIPS backbone from AlexNet to VGG-16— a one-line change—moves overall clean AUROC from0.740to 0.825. Any leaderboard that mixes implementations is thus com- paring backbones as much as methods. Second, t...

  4. [4]

    DISCUSSION Why are generators so heterogeneous?The per-generator pro- files in Tables 2 and 6 split the seven generators into two camps. RIGID-style perturbation sensitivity is strongest on ADM, BigGAN, and VQDM (0.816–0.931clean)—pixel-space diffusion, GAN, and vector-quantized models for which pronounced upsampling and spectral artifacts are documented ...

  5. [5]

    compression helps

    AUDIT CONCLUSIONS C1: Comparisons must control preprocessing and metric im- plementation.Backbone choice within thesamemethod (+0.085 AUROC) exceeds many reported method-over-method gaps; prepro- cessing flips per-generator rankings by up to0.38AUROC. Papers should report backbone, resolution policy, and crop policy as first- class experimental variables....

  6. [6]

    Generative adversarial nets,

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” inProc. NeurIPS, 2014

  7. [7]

    Large scale GAN training for high fidelity natural image synthesis,

    Andrew Brock, Jeff Donahue, and Karen Simonyan, “Large scale GAN training for high fidelity natural image synthesis,” inProc. ICLR, 2019

  8. [8]

    Denoising diffusion probabilistic models,

    Jonathan Ho, Ajay Jain, and Pieter Abbeel, “Denoising diffusion probabilistic models,” inProc. NeurIPS, 2020

  9. [9]

    Diffusion models beat GANs on image synthesis,

    Prafulla Dhariwal and Alexander Nichol, “Diffusion models beat GANs on image synthesis,” inProc. NeurIPS, 2021

  10. [10]

    GLIDE: Towards photorealistic image generation and edit- ing with text-guided diffusion models,

    Alexander Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen, “GLIDE: Towards photorealistic image generation and edit- ing with text-guided diffusion models,” inProc. ICML, 2022

  11. [11]

    High-resolution image synthesis with latent diffusion models,

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer, “High-resolution image synthesis with latent diffusion models,” inProc. IEEE/CVF CVPR, 2022

  12. [12]

    Vector quantized diffusion model for text-to-image synthesis,

    Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dong- dong Chen, Lu Yuan, and Baining Guo, “Vector quantized diffusion model for text-to-image synthesis,” inProc. IEEE/CVF CVPR, 2022

  13. [13]

    SDXL: Improving latent diffusion models for high-resolution image synthe- sis,

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach, “SDXL: Improving latent diffusion models for high-resolution image synthe- sis,” inProc. ICLR, 2024

  14. [14]

    CNN-generated images are surprisingly easy to spot... for now,

    Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A. Efros, “CNN-generated images are surprisingly easy to spot... for now,” inProc. IEEE/CVF CVPR, 2020

  15. [15]

    Are GAN generated images easy to detect? A critical analysis of the state-of-the-art,

    Diego Gragnaniello, Davide Cozzolino, Francesco Marra, Giovanni Poggi, and Luisa Verdoliva, “Are GAN generated images easy to detect? A critical analysis of the state-of-the-art,” inProc. IEEE ICME, 2021

  16. [16]

    On the detection of synthetic images generated by diffusion models,

    Riccardo Corvi, Davide Cozzolino, Giada Zingarini, Giovanni Poggi, Koki Nagano, and Luisa Verdoliva, “On the detection of synthetic images generated by diffusion models,” inProc. IEEE ICASSP, 2023

  17. [17]

    Do GANs leave artificial fingerprints?,

    Francesco Marra, Diego Gragnaniello, Luisa Verdoliva, and Gio- vanni Poggi, “Do GANs leave artificial fingerprints?,” inProc. IEEE MIPR, 2019

  18. [18]

    Leveraging frequency anal- ysis for deep fake image recognition,

    Joel Frank, Thorsten Eisenhofer, Lea Sch ¨onherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz, “Leveraging frequency anal- ysis for deep fake image recognition,” inProc. ICML, 2020

  19. [19]

    Intriguing properties of synthetic images: from generative adversarial networks to diffusion models,

    Riccardo Corvi, Davide Cozzolino, Giovanni Poggi, Koki Nagano, and Luisa Verdoliva, “Intriguing properties of synthetic images: from generative adversarial networks to diffusion models,” inProc. IEEE/CVF CVPR Workshops, 2023

  20. [20]

    Rethinking the up-sampling oper- ations in CNN-based generative network for generalizable deepfake detection,

    Chuangchuang Tan, Huan Liu, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei, “Rethinking the up-sampling oper- ations in CNN-based generative network for generalizable deepfake detection,” inProc. IEEE/CVF CVPR, 2024

  21. [21]

    Learning transferable visual models from natural language supervi- sion,

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever, “Learning transferable visual models from natural language supervi- sion,” inProc. ICML, 2021

  22. [22]

    Towards universal fake image detectors that generalize across generative models,

    Utkarsh Ojha, Yuheng Li, and Yong Jae Lee, “Towards universal fake image detectors that generalize across generative models,” in Proc. IEEE/CVF CVPR, 2023

  23. [23]

    A sanity check for AI-generated image detection,

    Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi Xie, “A sanity check for AI-generated image detection,” inProc. ICLR, 2025

  24. [24]

    AEROBLADE: Training-free detection of latent diffusion images using autoencoder reconstruction error,

    Jonas Ricker, Denis Lukovnikov, and Asja Fischer, “AEROBLADE: Training-free detection of latent diffusion images using autoencoder reconstruction error,” inProc. IEEE/CVF CVPR, 2024

  25. [25]

    RIGID: A training- free and model-agnostic framework for robust AI-generated image detection,

    Zhiyuan He, Pin-Yu Chen, and Tsung-Yi Ho, “RIGID: A training- free and model-agnostic framework for robust AI-generated image detection,”arXiv preprint arXiv:2405.20112, 2024

  26. [26]

    DIRE for diffusion- generated image detection,

    Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li, “DIRE for diffusion- generated image detection,” inProc. IEEE/CVF ICCV, 2023

  27. [27]

    HFI: A unified framework for training-free detection and implicit watermarking of latent dif- fusion model generated images,

    Sungik Choi, Hankook Lee, Jaehoon Lee, Robin Kim, Stan- ley Jungkyu Choi, and Moontae Lee, “HFI: A unified framework for training-free detection and implicit watermarking of latent dif- fusion model generated images,”arXiv preprint arXiv:2412.20704, 2024

  28. [28]

    EIRES: Training-free AI-generated image detection via edit-induced reconstruction error shift,

    Wan Jiang, Jing Yan, Xiaojing Chen, Ling Shen, Chenhao Lin, Yun- feng Diao, and Richang Hong, “EIRES: Training-free AI-generated image detection via edit-induced reconstruction error shift,”arXiv preprint arXiv:2510.25141, 2025

  29. [29]

    Zero-shot detection of AI-generated images,

    Davide Cozzolino, Giovanni Poggi, Matthias Nießner, and Luisa Verdoliva, “Zero-shot detection of AI-generated images,” inProc. ECCV, 2024

  30. [30]

    Understanding and improving training- free AI-generated image detections with vision foundation models,

    Chung-Ting Tsai, Ching-Yun Ko, I-Hsin Chung, Yu-Chiang Frank Wang, and Pin-Yu Chen, “Understanding and improving training- free AI-generated image detections with vision foundation models,” arXiv preprint arXiv:2411.19117, 2024

  31. [31]

    Fake or JPEG? Revealing common biases in generated image detection datasets,

    Patrick Grommelt, Louis Weiss, Franz-Josef Pfreundt, and Janis Ke- uper, “Fake or JPEG? Revealing common biases in generated image detection datasets,”arXiv preprint arXiv:2403.17608, 2024

  32. [32]

    A bias-free training paradigm for more general AI-generated image detection,

    Fabrizio Guillaro, Giada Zingarini, Ben Usman, Avneesh Sud, Da- vide Cozzolino, and Luisa Verdoliva, “A bias-free training paradigm for more general AI-generated image detection,” inProc. IEEE/CVF CVPR, 2025, pp. 18685–18694

  33. [33]

    Intermediate representations are strong AI-generated image detec- tors,

    Zhenhan Huang, Pin-Yu Chen, Tejaswini Pedapati, and Jianxi Gao, “Intermediate representations are strong AI-generated image detec- tors,”arXiv preprint arXiv:2605.04358, 2026

  34. [34]

    GenImage: A million-scale benchmark for detecting AI-generated image,

    Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang, “GenImage: A million-scale benchmark for detecting AI-generated image,” inProc. NeurIPS Datasets and Benchmarks Track, 2023

  35. [35]

    The unreasonable effectiveness of deep features as a perceptual metric,

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” inProc. IEEE/CVF CVPR, 2018

  36. [36]

    Ima- geNet classification with deep convolutional neural networks,

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, “Ima- geNet classification with deep convolutional neural networks,” in Proc. NeurIPS, 2012

  37. [37]

    Very deep convolutional networks for large-scale image recognition,

    Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” inProc. ICLR, 2015

  38. [38]

    DI- NOv2: Learning robust visual features without supervision,

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, et al., “DI- NOv2: Learning robust visual features without supervision,”Trans. Mach. Learn. Res., 2024