pith. machine review for the scientific record. sign in

arxiv: 2605.09296 · v1 · submitted 2026-05-10 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Micro-Defects Expose Macro-Fakes: Detecting AI-Generated Images via Local Distributional Shifts

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:26 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords AI-generated image detectionmicro-defectslocal distributional shiftspatch forensic signaturemaximum mean discrepancyforensic latent spacegenerative model artifacts
0
0 comments X

The pith

By shifting focus from global image semantics to local patches, a detector amplifies micro-defects in AI-generated images into measurable distributional gaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that existing detectors overlook small-scale irregularities in synthetic images because they aggregate features globally. It introduces a learnable projection that maps local patches into a forensic space where these irregularities produce larger statistical differences between real and generated images. Measuring those differences with a kernel-based test then separates the two classes more effectively. This matters because as generators improve overall realism, the remaining local artifacts may become the key reliable signal. Experiments across benchmarks show the local approach consistently beats global baselines.

Core claim

The central claim is that patch-wise modeling with a learnable Patch Forensic Signature produces provably larger discrepancies via Maximum Mean Discrepancy when localized forensic signals are present in generated images, enabling more reliable separation from real images than global feature methods.

What carries the argument

The learnable Patch Forensic Signature: a projection of semantic patch embeddings into a compact forensic latent space that preserves localized statistical irregularities for MMD comparison.

If this is right

  • Patch-wise modeling yields provably larger discrepancies than global aggregation when localized forensic signals exist.
  • The method separates real and generated images more reliably across multiple standard benchmarks.
  • Localized cues remain effective even when global semantics are realistic and hard to distinguish.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same local-projection idea could be tested on video frames or audio segments if micro-artifacts appear there too.
  • Different generative models might imprint distinct patterns in the forensic latent space, allowing model attribution as a side benefit.
  • Hybrid detectors that combine this local signal with global checks might further reduce false positives on edge cases.

Load-bearing premise

AI-generated images consistently contain localized micro-defects that survive semantic patch embedding and are not erased or masked by the learnable projection during training.

What would settle it

An experiment showing that removing the local patch projection and MMD step yields equivalent or better detection performance on the same benchmarks, or finding that advanced generators produce images without detectable localized micro-defects after embedding.

Figures

Figures reproduced from arXiv: 2605.09296 by Boxuan Zhang, Jiang Liu, Jianing Zhu, Qifan Wang, Ruixiang Tang.

Figure 1
Figure 1. Figure 1: Intuition behind Patch Forensic Signature (PFS). Left: a real cat and a generated dog with plausible localized irregu￾larities (highlighted). Middle: global image-level detection aggre￾gates a semantic-dominant representation, inadvertently reducing real/fake detection into semantic recognition (e.g., “cat vs. dog”). PFS maps patch-wise representations into an artifact-dominant forensic space, making subtl… view at source ↗
Figure 2
Figure 2. Figure 2: Motivation and Overview of the MDMF framework. (a) Global image-level detection compresses an image into a single feature for real/fake classification, where semantic factors can dominate the decision through a confounding path. (b) MDMF instead operates on patches and bases its prediction on distributional discrepancy, suppressing semantic interference and aligning the decision with artifact-related signa… view at source ↗
Figure 3
Figure 3. Figure 3: Examples visualization and performance comparison on OpenSora. Interpretation. Theorem 2.7 establishes that the empirical MMD concentrates around its population value with deviation scaling as O( p 1/M + 1/N). For real test images, the population MMD vanishes and values reflect only finite-sample fluctuations. For generated images, Proposition 2.6 guarantees a positive gap scaling with ∥∆PFS∥ 2 2 . When th… view at source ↗
Figure 4
Figure 4. Figure 4: Further analysis. (a) Sensitivity to patch size W; (b) Robustness to DINOv2 backbone variants; (c) Robustness to post-processing perturbations; (d) Comparison with patch-level hard voting under varying θpatch. strong performance on recent diffusion-based models, which are known to produce highly realistic images with sparse and localized artifacts that challenge existing detectors. These results validate t… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative visualization of localized forensic evidence. We compare representative real images and category-matched generated images with Grad-CAM, where warmer colors indicate higher predicted likelihood of being fake. Global-pooling baseline primarily highlights semantically salient regions with similar patterns for real and generated samples, whereas MDMF shows localized responses on generated images a… view at source ↗
Figure 6
Figure 6. Figure 6: Failure cases on borderline real images from the ImageNet validation set. For each example [PITH_FULL_IMAGE:figures/full_fig_p038_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative visualization on ADM. We compare real images and category-matched generated images, visualizing the responses of a global pooling baseline versus MDMF. Warmer colors indicate higher predicted likelihood of being fake. Real Fake Input Global MDMF (ours) Real Fake Input Global MDMF (ours) [PITH_FULL_IMAGE:figures/full_fig_p039_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional qualitative visualization on ADMG. We compare real images and category-matched generated images, visualizing the responses of a global pooling baseline versus MDMF. Warmer colors indicate higher predicted likelihood of being fake. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Additional qualitative visualization on LDM. We compare real images and category-matched generated images, visualizing the responses of a global pooling baseline versus MDMF. Warmer colors indicate higher predicted likelihood of being fake. Real Fake Input Global MDMF (ours) Real Fake Input Global MDMF (ours) [PITH_FULL_IMAGE:figures/full_fig_p040_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional qualitative visualization on DiT-XL/2. We compare real images and category-matched generated images, visualizing the responses of a global pooling baseline versus MDMF. Warmer colors indicate higher predicted likelihood of being fake. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_10.png] view at source ↗
read the original abstract

Recent generative models can produce images that appear highly realistic, raising challenges in distinguishing real and AI-generated images. Yet existing detectors based on pre-trained feature extractors tend to over-rely on global semantics, limiting sensitivity to the critical micro-defects. In this work, we propose Micro-Defects expose Macro-Fakes (MDMF), a local distribution-aware detection framework that amplifies micro-scale statistical irregularities into macro-level distributional discrepancies. To avoid localized forensic cues being diluted by plain aggregation, we introduce a learnable Patch Forensic Signature that projects semantic patch embeddings into a compact forensic latent space. We then use Maximum Mean Discrepancy (MMD) to quantify distributional discrepancies between generated and real images. Our theory-grounded analysis shows that patch-wise modeling yields provably larger discrepancies when localized forensic signals are present in generated images, enabling more reliable separation from real images. Extensive experiments demonstrate that MDMF consistently outperforms baseline detectors across multiple benchmarks, validating its general effectiveness. Project page: https://zbox1005.github.io/MDMF-project/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MDMF, a local distribution-aware framework for detecting AI-generated images. It extracts semantic patch embeddings from a pre-trained model, projects them via a learnable Patch Forensic Signature into a compact forensic space, and applies patch-wise Maximum Mean Discrepancy (MMD) to quantify distributional shifts. The central claim is that this patch-wise approach produces provably larger discrepancies than global methods when localized micro-defects are present, leading to more reliable real-vs-generated separation. Extensive experiments are said to show consistent outperformance over baselines on multiple benchmarks.

Significance. If the theoretical separation claim can be rigorously established and the method generalizes beyond the evaluated generators, MDMF could meaningfully advance image forensics by exploiting localized statistical irregularities that global semantic extractors tend to suppress. The combination of a learnable projection with MMD on patches is a plausible way to amplify subtle artifacts without requiring new architectures. However, the absence of a detailed derivation in the abstract and the reliance on an end-to-end trained projection limit the assessed significance until the proof and robustness checks are provided.

major comments (2)
  1. [Abstract] Abstract: The assertion that 'patch-wise modeling yields provably larger discrepancies' when localized forensic signals are present is stated without derivation steps, explicit assumptions, error analysis, or comparison to global MMD. This is load-bearing for the central claim of more reliable separation; without it, the advantage reduces to an empirical observation rather than a theory-grounded result.
  2. [Theoretical Analysis] Theoretical Analysis section: The key assumption that micro-defects survive both the pre-trained patch embeddings and the subsequent learnable projection (whose weights are free parameters) is not bounded or verified. If the projection is optimized end-to-end on the detection task, the reported MMD gap may be an artifact of fitting rather than an independent property of patch-wise modeling, creating circularity risk for the 'provable' claim.
minor comments (2)
  1. [Abstract] The acronym MMD is used without expansion on first appearance in the abstract; define it explicitly.
  2. The project page link is provided but no mention of code or data release; consider adding a reproducibility statement if code will be made available.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the presentation of our theoretical contributions. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that 'patch-wise modeling yields provably larger discrepancies' when localized forensic signals are present is stated without derivation steps, explicit assumptions, error analysis, or comparison to global MMD. This is load-bearing for the central claim of more reliable separation; without it, the advantage reduces to an empirical observation rather than a theory-grounded result.

    Authors: We agree that the abstract statement is too concise and lacks explicit pointers to the supporting derivation. In the revised version, we will expand the abstract to include a brief outline of the key steps (locality of defects implies patch-wise MMD strictly exceeds global MMD unless defects are uniformly distributed), the main assumption (micro-defects are spatially localized), and a direct comparison to global MMD. We will also add an explicit cross-reference to the full proof, assumptions, and error bounds in the Theoretical Analysis section. revision: yes

  2. Referee: [Theoretical Analysis] Theoretical Analysis section: The key assumption that micro-defects survive both the pre-trained patch embeddings and the subsequent learnable projection (whose weights are free parameters) is not bounded or verified. If the projection is optimized end-to-end on the detection task, the reported MMD gap may be an artifact of fitting rather than an independent property of patch-wise modeling, creating circularity risk for the 'provable' claim.

    Authors: This concern about potential circularity is well-taken. The existing proof in Section 3 establishes the inequality for any fixed linear projection, relying only on the spatial localization of defects rather than on the specific learned weights. To address the referee's point directly, we will add two elements in revision: (1) a short lemma bounding the effect of the learned projection under a Lipschitz continuity assumption on the pre-trained embeddings, showing that the gap cannot be driven to zero when defects remain localized, and (2) an ablation study comparing MMD values obtained with the learned projection against those obtained with a fixed identity or random projection. These additions will be placed in the Theoretical Analysis section to separate the general locality argument from the learned-projection case. revision: yes

Circularity Check

0 steps flagged

No circularity: theory-grounded MMD analysis is independent of fitted projection

full rationale

The paper's derivation chain centers on a learnable Patch Forensic Signature projecting pre-trained patch embeddings, followed by patch-wise MMD for discrepancy quantification. The claimed 'theory-grounded analysis' asserts provably larger discrepancies under localized forensic signals, which follows from standard properties of MMD (a kernel-based metric) applied to partitioned patches versus global aggregation, rather than from any fitted parameter or self-referential definition. No equations or steps reduce the 'provable' gap to the training objective by construction; the projection serves as an empirical amplifier while the inequality holds conditionally on signal presence in the embedding space. No self-citations are load-bearing for the core claim, no ansatz is smuggled, and no known result is merely renamed. The framework remains self-contained against external MMD theory and pre-trained extractors.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The framework rests on a new learnable component and the domain assumption that micro-defects exist locally in generated images; no external benchmarks or machine-checked proofs are referenced.

free parameters (1)
  • Patch Forensic Signature projection weights
    Learnable parameters that map semantic patch embeddings into the forensic latent space; fitted during training to emphasize defects.
axioms (2)
  • domain assumption Localized forensic signals exist and are preserved in patch embeddings of generated images
    Central to the claim that patch-wise MMD produces larger discrepancies.
  • standard math MMD is an appropriate and unbiased measure of distributional difference for forensic signatures
    Standard kernel-based statistic invoked without additional justification in the abstract.
invented entities (1)
  • Patch Forensic Signature no independent evidence
    purpose: Compact latent representation of patch embeddings that amplifies micro-defects while avoiding dilution by global aggregation
    New component introduced to project embeddings into a forensic space; no independent evidence outside the method itself.

pith-pipeline@v0.9.0 · 5499 in / 1467 out tokens · 70392 ms · 2026-05-12T04:26:56.068348+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 8 internal anchors

  1. [1]

    Large Scale GAN Training for High Fidelity Natural Image Synthesis

    Andrew Brock. Large scale gan training for high fidelity natural image synthesis.arXiv preprint arXiv:1809.11096, 2018

  2. [2]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024

  3. [3]

    Real-time deepfake detection in the real-world

    Bar Cavia, Eliahu Horwitz, Tal Reiss, and Yedid Hoshen. Real-time deepfake detection in the real-world.arXiv preprint arXiv:2406.09398, 2024

  4. [4]

    What makes fake images detectable? understanding properties that generalize

    Lucy Chai, David Bau, Ser-Nam Lim, and Phillip Isola. What makes fake images detectable? understanding properties that generalize. InEuropean conference on computer vision, pages 103–120. Springer, 2020

  5. [5]

    Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images

    Baoying Chen, Jishen Zeng, Jianquan Yang, and Rui Yang. Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images. InForty-first International Conference on Machine Learning, 2024

  6. [6]

    Demamba: Ai-generated video detection on million-scale genvideo benchmark,

    Haoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, et al. Demamba: Ai-generated video detection on million-scale genvideo benchmark.arXiv preprint arXiv:2405.19707, 2024

  7. [7]

    Training-free detection of ai-generated images via cropping robustness.arXiv preprint arXiv:2511.14030, 2025

    Sungik Choi, Hankook Lee, and Moontae Lee. Training-free detection of ai-generated images via cropping robustness.arXiv preprint arXiv:2511.14030, 2025

  8. [8]

    On the detection of synthetic images generated by diffusion models

    Riccardo Corvi, Davide Cozzolino, Giada Zingarini, Giovanni Poggi, Koki Nagano, and Luisa Verdoliva. On the detection of synthetic images generated by diffusion models. InICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023

  9. [9]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  10. [10]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021

  11. [11]

    A kernel two-sample test.The journal of machine learning research, 13(1):723–773, 2012

    Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.The journal of machine learning research, 13(1):723–773, 2012

  12. [12]

    Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark.Advances in Neural Information Processing Systems, 35:26418–26431, 2022

    Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei Zhang, Xin Jiang, et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark.Advances in Neural Information Processing Systems, 35:26418–26431, 2022

  13. [13]

    Rigid: A training-free and model-agnostic framework for robust ai-generated image detection.arXiv preprint arXiv:2405.20112, 2024

    Zhiyuan He, Pin-Yu Chen, and Tsung-Yi Ho. Rigid: A training-free and model-agnostic framework for robust ai-generated image detection.arXiv preprint arXiv:2405.20112, 2024. 10

  14. [14]

    Deepfake detection using deep learning methods: A systematic and comprehensive review.Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 14(2):e1520, 2024

    Arash Heidari, Nima Jafari Navimipour, Hasan Dag, and Mehmet Unal. Deepfake detection using deep learning methods: A systematic and comprehensive review.Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 14(2):e1520, 2024

  15. [15]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  16. [16]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019

  17. [17]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

  18. [18]

    Improving synthetic image detection towards generalization: An image transformation perspective

    Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Fuli Feng. Improving synthetic image detection towards generalization: An image transformation perspective. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 2405–2414, 2025

  19. [19]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  20. [20]

    Learning deep kernels for non-parametric two-sample tests

    Feng Liu, Wenkai Xu, Jie Lu, Guangquan Zhang, Arthur Gretton, and Danica J Sutherland. Learning deep kernels for non-parametric two-sample tests. InInternational conference on machine learning, pages 6316–6326. PMLR, 2020

  21. [21]

    Forgery-aware adaptive transformer for generalizable synthetic image detection

    Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and Yao Zhao. Forgery-aware adaptive transformer for generalizable synthetic image detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10770–10780, 2024

  22. [22]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

  23. [23]

    Global texture enhancement for fake face detection in the wild

    Zhengzhe Liu, Xiaojuan Qi, and Philip HS Torr. Global texture enhancement for fake face detection in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8060–8069, 2020

  24. [24]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021

  25. [25]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational conference on machine learning, pages 8162–8171. PMLR, 2021

  26. [26]

    Towards universal fake image detectors that generalize across generative models

    Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards universal fake image detectors that generalize across generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24480–24489, 2023

  27. [27]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...

  28. [28]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  29. [29]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 11

  30. [30]

    Thinking in frequency: Face forgery detection by mining frequency-aware clues

    Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. InEuropean conference on computer vision, pages 86–103. Springer, 2020

  31. [31]

    Stay-positive: A case for ignoring real image features in fake image detection.arXiv preprint arXiv:2502.07778, 2025

    Anirudh Sundara Rajan and Yong Jae Lee. Stay-positive: A case for ignoring real image features in fake image detection.arXiv preprint arXiv:2502.07778, 2025

  32. [32]

    Aligned datasets improve detection of latent diffusion-generated images.arXiv preprint arXiv:2410.11835, 2024

    Anirudh Sundara Rajan, Utkarsh Ojha, Jedidiah Schloesser, and Yong Jae Lee. Aligned datasets improve detection of latent diffusion-generated images.arXiv preprint arXiv:2410.11835, 2024

  33. [33]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

  34. [34]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  35. [35]

    Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

  36. [36]

    Learning structured output representation using deep conditional generative models.Advances in neural information processing systems, 28, 2015

    Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models.Advances in neural information processing systems, 28, 2015

  37. [37]

    Diffu- sion art or digital forgery? investigating data replication in diffusion models

    Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Diffu- sion art or digital forgery? investigating data replication in diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6048–6058, 2023

  38. [38]

    C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection

    Chuangchuang Tan, Renshuai Tao, Huan Liu, Guanghua Gu, Baoyuan Wu, Yao Zhao, and Yunchao Wei. C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7184–7192, 2025

  39. [39]

    Rethink- ing the up-sampling operations in cnn-based generative network for generalizable deepfake detection

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethink- ing the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28130–28139, 2024

  40. [40]

    Training data-efficient image transformers & distillation through attention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021

  41. [41]

    Cambridge university press, 2019

    Martin J Wainwright.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019

  42. [42]

    Lota: Bit-planes guided ai-generated image detection

    Hongsong Wang, Renxi Cheng, Yang Zhang, Chaolei Han, and Jie Gui. Lota: Bit-planes guided ai-generated image detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17246–17255, 2025

  43. [43]

    Detecting human artifacts from text-to- image models.arXiv preprint arXiv:2411.13842, 2024

    Kaihong Wang, Lingzhi Zhang, and Jianming Zhang. Detecting human artifacts from text-to- image models.arXiv preprint arXiv:2411.13842, 2024

  44. [44]

    Cnn- generated images are surprisingly easy to spot

    Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn- generated images are surprisingly easy to spot... for now. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8695–8704, 2020

  45. [45]

    Embedding trajectory for out-of-distribution detection in mathematical reasoning.Advances in Neural Information Processing Systems, 37:42965–42999, 2024

    Yiming Wang, Pei Zhang, Baosong Yang, Derek Wong, Zhuosheng Zhang, and Rui Wang. Embedding trajectory for out-of-distribution detection in mathematical reasoning.Advances in Neural Information Processing Systems, 37:42965–42999, 2024. 12

  46. [46]

    Diff- doctor: Diagnosing image diffusion models before treating.arXiv preprint arXiv:2501.12382, 2025

    Yiyang Wang, Xi Chen, Xiaogang Xu, Sihui Ji, Yu Liu, Yujun Shen, and Hengshuang Zhao. Diff- doctor: Diagnosing image diffusion models before treating.arXiv preprint arXiv:2501.12382, 2025

  47. [47]

    Dire for diffusion-generated image detection

    Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion-generated image detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22445–22455, 2023

  48. [48]

    Msr-vtt: A large video description dataset for bridging video and language

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016

  49. [49]

    A sanity check for ai-generated image detection.arXiv preprint arXiv:2406.19435, 2024

    Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi Xie. A sanity check for ai-generated image detection.arXiv preprint arXiv:2406.19435, 2024

  50. [50]

    Orthogonal subspace decomposition for generalizable ai-generated image detection.arXiv preprint arXiv:2411.15633, 2024

    Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. Orthogonal subspace decomposition for generalizable ai-generated image detection.arXiv preprint arXiv:2411.15633, 2024

  51. [51]

    LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop

    Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop.arXiv preprint arXiv:1506.03365, 2015

  52. [52]

    What if the input is expanded in ood detection?Advances in Neural Information Processing Systems, 37:21289–21329, 2024

    Boxuan Zhang, Jianing Zhu, Zengmao Wang, Tongliang Liu, Bo Du, and Bo Han. What if the input is expanded in ood detection?Advances in Neural Information Processing Systems, 37:21289–21329, 2024

  53. [53]

    Physics-driven spatiotemporal modeling for ai-generated video detection

    Shuhai Zhang, ZiHao Lian, Jiahao Yang, Daiyuan Li, Guoxuan Pang, Feng Liu, Bo Han, Shutao Li, and Mingkui Tan. Physics-driven spatiotemporal modeling for ai-generated video detection. arXiv preprint arXiv:2510.08073, 2025

  54. [54]

    Detecting machine-generated texts by multi-population aware optimization for maximum mean discrep- ancy.arXiv preprint arXiv:2402.16041, 2024

    Shuhai Zhang, Yiliao Song, Jiahao Yang, Yuanqing Li, Bo Han, and Mingkui Tan. Detecting machine-generated texts by multi-population aware optimization for maximum mean discrep- ancy.arXiv preprint arXiv:2402.16041, 2024

  55. [55]

    Detecting and simulating artifacts in gan fake images

    Xu Zhang, Svebor Karaman, and Shih-Fu Chang. Detecting and simulating artifacts in gan fake images. In2019 IEEE international workshop on information forensics and security (WIFS), pages 1–6. IEEE, 2019

  56. [56]

    arXiv preprint arXiv:2511.01293 , year=

    Yonggang Zhang, Jun Nie, Xinmei Tian, Mingming Gong, Kun Zhang, and Bo Han. Detecting generated images by fitting natural image distributions.arXiv preprint arXiv:2511.01293, 2025

  57. [57]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024

  58. [58]

    arXiv preprint arXiv:2311.12397 , year=

    Nan Zhong, Yiran Xu, Sheng Li, Zhenxing Qian, and Xinpeng Zhang. Patchcraft: Exploring texture patch for efficient ai-generated image detection.arXiv preprint arXiv:2311.12397, 2023

  59. [59]

    Synthetic lies: Understanding ai-generated misinformation and evaluating algorithmic and human solutions

    Jiawei Zhou, Yixuan Zhang, Qianni Luo, Andrea G Parker, and Munmun De Choudhury. Synthetic lies: Understanding ai-generated misinformation and evaluating algorithmic and human solutions. InProceedings of the 2023 CHI conference on human factors in computing systems, pages 1–20, 2023

  60. [60]

    Gendet: Towards good generalizations for ai-generated image detection.arXiv preprint arXiv:2312.08880, 2023

    Mingjian Zhu, Hanting Chen, Mouxiao Huang, Wei Li, Hailin Hu, Jie Hu, and Yunhe Wang. Gendet: Towards good generalizations for ai-generated image detection.arXiv preprint arXiv:2312.08880, 2023

  61. [61]

    Genimage: A million-scale benchmark for detecting ai-generated image.Advances in Neural Information Processing Systems, 36:77771–77782, 2023

    Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. Genimage: A million-scale benchmark for detecting ai-generated image.Advances in Neural Information Processing Systems, 36:77771–77782, 2023. 13 Appendices A Theoretical Analysis 15 A.1 Preliminaries and Modeling Assumptions . . . . . ....

  62. [62]

    24 Positivity and monotonicity in ∥∆PFS∥2.Let a:= γ2 γ2+2σ2z Kd 2 >0 and t:=∥∆ PFS∥2 ≥0

    Hence, MMD2(P,Q;k ω) = 2 γ2 γ2 + 2σ2z Kd 2 1−exp − K∥∆PFS∥2 2 2(γ2 + 2σ2z) , which proves (11). 24 Positivity and monotonicity in ∥∆PFS∥2.Let a:= γ2 γ2+2σ2z Kd 2 >0 and t:=∥∆ PFS∥2 ≥0 . Then MMD2(P,Q;k ω) = 2a 1−e −t2/(2(γ2+2σ2 z)) . Ift >0, thene −t2/(2(γ2+2σ2 z)) ∈(0,1)and henceMMD 2 >0. Moreover, d dt 1−e −t2/(2(γ2+2σ2 z)) =e −t2/(2(γ2+2σ2 z)) · t γ2 +...

  63. [63]

    always fake

    learns from diffusion reconstructions and contrastive hard samples to enhance robustness, F- ConV [56] exploits manifold geometry with flow-based extrusion. Motivated by the increasing sparsity of generative artifacts, some methods shift to patch-level evidence. PatchCraft [58] enhances texture traces via smash and reconstruction, FatFormer [ 21] adapts C...