arxiv: 2605.09296 · v1 · submitted 2026-05-10 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Micro-Defects Expose Macro-Fakes: Detecting AI-Generated Images via Local Distributional Shifts

Boxuan Zhang , Jianing Zhu , Qifan Wang , Jiang Liu , Ruixiang Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:26 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords AI-generated image detectionmicro-defectslocal distributional shiftspatch forensic signaturemaximum mean discrepancyforensic latent spacegenerative model artifacts

0 comments

The pith

By shifting focus from global image semantics to local patches, a detector amplifies micro-defects in AI-generated images into measurable distributional gaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that existing detectors overlook small-scale irregularities in synthetic images because they aggregate features globally. It introduces a learnable projection that maps local patches into a forensic space where these irregularities produce larger statistical differences between real and generated images. Measuring those differences with a kernel-based test then separates the two classes more effectively. This matters because as generators improve overall realism, the remaining local artifacts may become the key reliable signal. Experiments across benchmarks show the local approach consistently beats global baselines.

Core claim

The central claim is that patch-wise modeling with a learnable Patch Forensic Signature produces provably larger discrepancies via Maximum Mean Discrepancy when localized forensic signals are present in generated images, enabling more reliable separation from real images than global feature methods.

What carries the argument

The learnable Patch Forensic Signature: a projection of semantic patch embeddings into a compact forensic latent space that preserves localized statistical irregularities for MMD comparison.

If this is right

Patch-wise modeling yields provably larger discrepancies than global aggregation when localized forensic signals exist.
The method separates real and generated images more reliably across multiple standard benchmarks.
Localized cues remain effective even when global semantics are realistic and hard to distinguish.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same local-projection idea could be tested on video frames or audio segments if micro-artifacts appear there too.
Different generative models might imprint distinct patterns in the forensic latent space, allowing model attribution as a side benefit.
Hybrid detectors that combine this local signal with global checks might further reduce false positives on edge cases.

Load-bearing premise

AI-generated images consistently contain localized micro-defects that survive semantic patch embedding and are not erased or masked by the learnable projection during training.

What would settle it

An experiment showing that removing the local patch projection and MMD step yields equivalent or better detection performance on the same benchmarks, or finding that advanced generators produce images without detectable localized micro-defects after embedding.

Figures

Figures reproduced from arXiv: 2605.09296 by Boxuan Zhang, Jiang Liu, Jianing Zhu, Qifan Wang, Ruixiang Tang.

**Figure 1.** Figure 1: Intuition behind Patch Forensic Signature (PFS). Left: a real cat and a generated dog with plausible localized irregularities (highlighted). Middle: global image-level detection aggregates a semantic-dominant representation, inadvertently reducing real/fake detection into semantic recognition (e.g., “cat vs. dog”). PFS maps patch-wise representations into an artifact-dominant forensic space, making subtl… view at source ↗

**Figure 2.** Figure 2: Motivation and Overview of the MDMF framework. (a) Global image-level detection compresses an image into a single feature for real/fake classification, where semantic factors can dominate the decision through a confounding path. (b) MDMF instead operates on patches and bases its prediction on distributional discrepancy, suppressing semantic interference and aligning the decision with artifact-related signa… view at source ↗

**Figure 3.** Figure 3: Examples visualization and performance comparison on OpenSora. Interpretation. Theorem 2.7 establishes that the empirical MMD concentrates around its population value with deviation scaling as O( p 1/M + 1/N). For real test images, the population MMD vanishes and values reflect only finite-sample fluctuations. For generated images, Proposition 2.6 guarantees a positive gap scaling with ∥∆PFS∥ 2 2 . When th… view at source ↗

**Figure 4.** Figure 4: Further analysis. (a) Sensitivity to patch size W; (b) Robustness to DINOv2 backbone variants; (c) Robustness to post-processing perturbations; (d) Comparison with patch-level hard voting under varying θpatch. strong performance on recent diffusion-based models, which are known to produce highly realistic images with sparse and localized artifacts that challenge existing detectors. These results validate t… view at source ↗

**Figure 5.** Figure 5: Qualitative visualization of localized forensic evidence. We compare representative real images and category-matched generated images with Grad-CAM, where warmer colors indicate higher predicted likelihood of being fake. Global-pooling baseline primarily highlights semantically salient regions with similar patterns for real and generated samples, whereas MDMF shows localized responses on generated images a… view at source ↗

**Figure 6.** Figure 6: Failure cases on borderline real images from the ImageNet validation set. For each example [PITH_FULL_IMAGE:figures/full_fig_p038_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative visualization on ADM. We compare real images and category-matched generated images, visualizing the responses of a global pooling baseline versus MDMF. Warmer colors indicate higher predicted likelihood of being fake. Real Fake Input Global MDMF (ours) Real Fake Input Global MDMF (ours) [PITH_FULL_IMAGE:figures/full_fig_p039_7.png] view at source ↗

**Figure 8.** Figure 8: Additional qualitative visualization on ADMG. We compare real images and category-matched generated images, visualizing the responses of a global pooling baseline versus MDMF. Warmer colors indicate higher predicted likelihood of being fake. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_8.png] view at source ↗

**Figure 9.** Figure 9: Additional qualitative visualization on LDM. We compare real images and category-matched generated images, visualizing the responses of a global pooling baseline versus MDMF. Warmer colors indicate higher predicted likelihood of being fake. Real Fake Input Global MDMF (ours) Real Fake Input Global MDMF (ours) [PITH_FULL_IMAGE:figures/full_fig_p040_9.png] view at source ↗

**Figure 10.** Figure 10: Additional qualitative visualization on DiT-XL/2. We compare real images and category-matched generated images, visualizing the responses of a global pooling baseline versus MDMF. Warmer colors indicate higher predicted likelihood of being fake. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_10.png] view at source ↗

read the original abstract

Recent generative models can produce images that appear highly realistic, raising challenges in distinguishing real and AI-generated images. Yet existing detectors based on pre-trained feature extractors tend to over-rely on global semantics, limiting sensitivity to the critical micro-defects. In this work, we propose Micro-Defects expose Macro-Fakes (MDMF), a local distribution-aware detection framework that amplifies micro-scale statistical irregularities into macro-level distributional discrepancies. To avoid localized forensic cues being diluted by plain aggregation, we introduce a learnable Patch Forensic Signature that projects semantic patch embeddings into a compact forensic latent space. We then use Maximum Mean Discrepancy (MMD) to quantify distributional discrepancies between generated and real images. Our theory-grounded analysis shows that patch-wise modeling yields provably larger discrepancies when localized forensic signals are present in generated images, enabling more reliable separation from real images. Extensive experiments demonstrate that MDMF consistently outperforms baseline detectors across multiple benchmarks, validating its general effectiveness. Project page: https://zbox1005.github.io/MDMF-project/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a learnable projection on patch embeddings before MMD to catch local shifts in AI images, with claimed empirical gains, but the provable discrepancy claim has no visible math or assumptions.

read the letter

This paper's main move is to shift from global feature detectors to patch-level analysis for spotting AI-generated images. They introduce a learnable Patch Forensic Signature that projects semantic patch embeddings into a compact space, then apply MMD to measure distributional differences between real and generated patches. The idea is that this amplifies micro-defects that get lost in whole-image comparisons.

Referee Report

2 major / 2 minor

Summary. The paper introduces MDMF, a local distribution-aware framework for detecting AI-generated images. It extracts semantic patch embeddings from a pre-trained model, projects them via a learnable Patch Forensic Signature into a compact forensic space, and applies patch-wise Maximum Mean Discrepancy (MMD) to quantify distributional shifts. The central claim is that this patch-wise approach produces provably larger discrepancies than global methods when localized micro-defects are present, leading to more reliable real-vs-generated separation. Extensive experiments are said to show consistent outperformance over baselines on multiple benchmarks.

Significance. If the theoretical separation claim can be rigorously established and the method generalizes beyond the evaluated generators, MDMF could meaningfully advance image forensics by exploiting localized statistical irregularities that global semantic extractors tend to suppress. The combination of a learnable projection with MMD on patches is a plausible way to amplify subtle artifacts without requiring new architectures. However, the absence of a detailed derivation in the abstract and the reliance on an end-to-end trained projection limit the assessed significance until the proof and robustness checks are provided.

major comments (2)

[Abstract] Abstract: The assertion that 'patch-wise modeling yields provably larger discrepancies' when localized forensic signals are present is stated without derivation steps, explicit assumptions, error analysis, or comparison to global MMD. This is load-bearing for the central claim of more reliable separation; without it, the advantage reduces to an empirical observation rather than a theory-grounded result.
[Theoretical Analysis] Theoretical Analysis section: The key assumption that micro-defects survive both the pre-trained patch embeddings and the subsequent learnable projection (whose weights are free parameters) is not bounded or verified. If the projection is optimized end-to-end on the detection task, the reported MMD gap may be an artifact of fitting rather than an independent property of patch-wise modeling, creating circularity risk for the 'provable' claim.

minor comments (2)

[Abstract] The acronym MMD is used without expansion on first appearance in the abstract; define it explicitly.
The project page link is provided but no mention of code or data release; consider adding a reproducibility statement if code will be made available.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the presentation of our theoretical contributions. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that 'patch-wise modeling yields provably larger discrepancies' when localized forensic signals are present is stated without derivation steps, explicit assumptions, error analysis, or comparison to global MMD. This is load-bearing for the central claim of more reliable separation; without it, the advantage reduces to an empirical observation rather than a theory-grounded result.

Authors: We agree that the abstract statement is too concise and lacks explicit pointers to the supporting derivation. In the revised version, we will expand the abstract to include a brief outline of the key steps (locality of defects implies patch-wise MMD strictly exceeds global MMD unless defects are uniformly distributed), the main assumption (micro-defects are spatially localized), and a direct comparison to global MMD. We will also add an explicit cross-reference to the full proof, assumptions, and error bounds in the Theoretical Analysis section. revision: yes
Referee: [Theoretical Analysis] Theoretical Analysis section: The key assumption that micro-defects survive both the pre-trained patch embeddings and the subsequent learnable projection (whose weights are free parameters) is not bounded or verified. If the projection is optimized end-to-end on the detection task, the reported MMD gap may be an artifact of fitting rather than an independent property of patch-wise modeling, creating circularity risk for the 'provable' claim.

Authors: This concern about potential circularity is well-taken. The existing proof in Section 3 establishes the inequality for any fixed linear projection, relying only on the spatial localization of defects rather than on the specific learned weights. To address the referee's point directly, we will add two elements in revision: (1) a short lemma bounding the effect of the learned projection under a Lipschitz continuity assumption on the pre-trained embeddings, showing that the gap cannot be driven to zero when defects remain localized, and (2) an ablation study comparing MMD values obtained with the learned projection against those obtained with a fixed identity or random projection. These additions will be placed in the Theoretical Analysis section to separate the general locality argument from the learned-projection case. revision: yes

Circularity Check

0 steps flagged

No circularity: theory-grounded MMD analysis is independent of fitted projection

full rationale

The paper's derivation chain centers on a learnable Patch Forensic Signature projecting pre-trained patch embeddings, followed by patch-wise MMD for discrepancy quantification. The claimed 'theory-grounded analysis' asserts provably larger discrepancies under localized forensic signals, which follows from standard properties of MMD (a kernel-based metric) applied to partitioned patches versus global aggregation, rather than from any fitted parameter or self-referential definition. No equations or steps reduce the 'provable' gap to the training objective by construction; the projection serves as an empirical amplifier while the inequality holds conditionally on signal presence in the embedding space. No self-citations are load-bearing for the core claim, no ansatz is smuggled, and no known result is merely renamed. The framework remains self-contained against external MMD theory and pre-trained extractors.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The framework rests on a new learnable component and the domain assumption that micro-defects exist locally in generated images; no external benchmarks or machine-checked proofs are referenced.

free parameters (1)

Patch Forensic Signature projection weights
Learnable parameters that map semantic patch embeddings into the forensic latent space; fitted during training to emphasize defects.

axioms (2)

domain assumption Localized forensic signals exist and are preserved in patch embeddings of generated images
Central to the claim that patch-wise MMD produces larger discrepancies.
standard math MMD is an appropriate and unbiased measure of distributional difference for forensic signatures
Standard kernel-based statistic invoked without additional justification in the abstract.

invented entities (1)

Patch Forensic Signature no independent evidence
purpose: Compact latent representation of patch embeddings that amplifies micro-defects while avoiding dilution by global aggregation
New component introduced to project embeddings into a forensic space; no independent evidence outside the method itself.

pith-pipeline@v0.9.0 · 5499 in / 1467 out tokens · 70392 ms · 2026-05-12T04:26:56.068348+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We then use Maximum Mean Discrepancy (MMD) to quantify distributional discrepancies between generated and real images... patch-wise modeling yields provably larger discrepancies
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_fourth_deriv_at_zero unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Assumption 2.3 (Sparse Defect Model)... ei(y)=ui + ai si μdefect

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 8 internal anchors

[1]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Andrew Brock. Large scale gan training for high fidelity natural image synthesis.arXiv preprint arXiv:1809.11096, 2018

work page internal anchor Pith review arXiv 2018
[2]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024

work page 2024
[3]

Real-time deepfake detection in the real-world

Bar Cavia, Eliahu Horwitz, Tal Reiss, and Yedid Hoshen. Real-time deepfake detection in the real-world.arXiv preprint arXiv:2406.09398, 2024

work page arXiv 2024
[4]

What makes fake images detectable? understanding properties that generalize

Lucy Chai, David Bau, Ser-Nam Lim, and Phillip Isola. What makes fake images detectable? understanding properties that generalize. InEuropean conference on computer vision, pages 103–120. Springer, 2020

work page 2020
[5]

Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images

Baoying Chen, Jishen Zeng, Jianquan Yang, and Rui Yang. Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images. InForty-first International Conference on Machine Learning, 2024

work page 2024
[6]

Demamba: Ai-generated video detection on million-scale genvideo benchmark,

Haoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, et al. Demamba: Ai-generated video detection on million-scale genvideo benchmark.arXiv preprint arXiv:2405.19707, 2024

work page arXiv 2024
[7]

Training-free detection of ai-generated images via cropping robustness.arXiv preprint arXiv:2511.14030, 2025

Sungik Choi, Hankook Lee, and Moontae Lee. Training-free detection of ai-generated images via cropping robustness.arXiv preprint arXiv:2511.14030, 2025

work page arXiv 2025
[8]

On the detection of synthetic images generated by diffusion models

Riccardo Corvi, Davide Cozzolino, Giada Zingarini, Giovanni Poggi, Koki Nagano, and Luisa Verdoliva. On the detection of synthetic images generated by diffusion models. InICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023

work page 2023
[9]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009
[10]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021

work page 2021
[11]

A kernel two-sample test.The journal of machine learning research, 13(1):723–773, 2012

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.The journal of machine learning research, 13(1):723–773, 2012

work page 2012
[12]

Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark.Advances in Neural Information Processing Systems, 35:26418–26431, 2022

Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei Zhang, Xin Jiang, et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark.Advances in Neural Information Processing Systems, 35:26418–26431, 2022

work page 2022
[13]

Rigid: A training-free and model-agnostic framework for robust ai-generated image detection.arXiv preprint arXiv:2405.20112, 2024

Zhiyuan He, Pin-Yu Chen, and Tsung-Yi Ho. Rigid: A training-free and model-agnostic framework for robust ai-generated image detection.arXiv preprint arXiv:2405.20112, 2024. 10

work page arXiv 2024
[14]

Deepfake detection using deep learning methods: A systematic and comprehensive review.Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 14(2):e1520, 2024

Arash Heidari, Nima Jafari Navimipour, Hasan Dag, and Mehmet Unal. Deepfake detection using deep learning methods: A systematic and comprehensive review.Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 14(2):e1520, 2024

work page 2024
[15]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[16]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019

work page 2019
[17]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[18]

Improving synthetic image detection towards generalization: An image transformation perspective

Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Fuli Feng. Improving synthetic image detection towards generalization: An image transformation perspective. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 2405–2414, 2025

work page 2025
[19]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Learning deep kernels for non-parametric two-sample tests

Feng Liu, Wenkai Xu, Jie Lu, Guangquan Zhang, Arthur Gretton, and Danica J Sutherland. Learning deep kernels for non-parametric two-sample tests. InInternational conference on machine learning, pages 6316–6326. PMLR, 2020

work page 2020
[21]

Forgery-aware adaptive transformer for generalizable synthetic image detection

Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and Yao Zhao. Forgery-aware adaptive transformer for generalizable synthetic image detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10770–10780, 2024

work page 2024
[22]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

work page 2021
[23]

Global texture enhancement for fake face detection in the wild

Zhengzhe Liu, Xiaojuan Qi, and Philip HS Torr. Global texture enhancement for fake face detection in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8060–8069, 2020

work page 2020
[24]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[25]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational conference on machine learning, pages 8162–8171. PMLR, 2021

work page 2021
[26]

Towards universal fake image detectors that generalize across generative models

Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards universal fake image detectors that generalize across generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24480–24489, 2023

work page 2023
[27]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...

work page 2024
[28]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[29]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Thinking in frequency: Face forgery detection by mining frequency-aware clues

Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. InEuropean conference on computer vision, pages 86–103. Springer, 2020

work page 2020
[31]

Stay-positive: A case for ignoring real image features in fake image detection.arXiv preprint arXiv:2502.07778, 2025

Anirudh Sundara Rajan and Yong Jae Lee. Stay-positive: A case for ignoring real image features in fake image detection.arXiv preprint arXiv:2502.07778, 2025

work page arXiv 2025
[32]

Aligned datasets improve detection of latent diffusion-generated images.arXiv preprint arXiv:2410.11835, 2024

Anirudh Sundara Rajan, Utkarsh Ojha, Jedidiah Schloesser, and Yong Jae Lee. Aligned datasets improve detection of latent diffusion-generated images.arXiv preprint arXiv:2410.11835, 2024

work page arXiv 2024
[33]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[35]

Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

work page 2022
[36]

Learning structured output representation using deep conditional generative models.Advances in neural information processing systems, 28, 2015

Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models.Advances in neural information processing systems, 28, 2015

work page 2015
[37]

Diffu- sion art or digital forgery? investigating data replication in diffusion models

Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Diffu- sion art or digital forgery? investigating data replication in diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6048–6058, 2023

work page 2023
[38]

C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection

Chuangchuang Tan, Renshuai Tao, Huan Liu, Guanghua Gu, Baoyuan Wu, Yao Zhao, and Yunchao Wei. C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7184–7192, 2025

work page 2025
[39]

Rethink- ing the up-sampling operations in cnn-based generative network for generalizable deepfake detection

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethink- ing the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28130–28139, 2024

work page 2024
[40]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021

work page 2021
[41]

Cambridge university press, 2019

Martin J Wainwright.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019

work page 2019
[42]

Lota: Bit-planes guided ai-generated image detection

Hongsong Wang, Renxi Cheng, Yang Zhang, Chaolei Han, and Jie Gui. Lota: Bit-planes guided ai-generated image detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17246–17255, 2025

work page 2025
[43]

Detecting human artifacts from text-to- image models.arXiv preprint arXiv:2411.13842, 2024

Kaihong Wang, Lingzhi Zhang, and Jianming Zhang. Detecting human artifacts from text-to- image models.arXiv preprint arXiv:2411.13842, 2024

work page arXiv 2024
[44]

Cnn- generated images are surprisingly easy to spot

Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn- generated images are surprisingly easy to spot... for now. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8695–8704, 2020

work page 2020
[45]

Embedding trajectory for out-of-distribution detection in mathematical reasoning.Advances in Neural Information Processing Systems, 37:42965–42999, 2024

Yiming Wang, Pei Zhang, Baosong Yang, Derek Wong, Zhuosheng Zhang, and Rui Wang. Embedding trajectory for out-of-distribution detection in mathematical reasoning.Advances in Neural Information Processing Systems, 37:42965–42999, 2024. 12

work page 2024
[46]

Diff- doctor: Diagnosing image diffusion models before treating.arXiv preprint arXiv:2501.12382, 2025

Yiyang Wang, Xi Chen, Xiaogang Xu, Sihui Ji, Yu Liu, Yujun Shen, and Hengshuang Zhao. Diff- doctor: Diagnosing image diffusion models before treating.arXiv preprint arXiv:2501.12382, 2025

work page arXiv 2025
[47]

Dire for diffusion-generated image detection

Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion-generated image detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22445–22455, 2023

work page 2023
[48]

Msr-vtt: A large video description dataset for bridging video and language

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016

work page 2016
[49]

A sanity check for ai-generated image detection.arXiv preprint arXiv:2406.19435, 2024

Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi Xie. A sanity check for ai-generated image detection.arXiv preprint arXiv:2406.19435, 2024

work page arXiv 2024
[50]

Orthogonal subspace decomposition for generalizable ai-generated image detection.arXiv preprint arXiv:2411.15633, 2024

Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. Orthogonal subspace decomposition for generalizable ai-generated image detection.arXiv preprint arXiv:2411.15633, 2024

work page arXiv 2024
[51]

LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop

Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop.arXiv preprint arXiv:1506.03365, 2015

work page internal anchor Pith review arXiv 2015
[52]

What if the input is expanded in ood detection?Advances in Neural Information Processing Systems, 37:21289–21329, 2024

Boxuan Zhang, Jianing Zhu, Zengmao Wang, Tongliang Liu, Bo Du, and Bo Han. What if the input is expanded in ood detection?Advances in Neural Information Processing Systems, 37:21289–21329, 2024

work page 2024
[53]

Physics-driven spatiotemporal modeling for ai-generated video detection

Shuhai Zhang, ZiHao Lian, Jiahao Yang, Daiyuan Li, Guoxuan Pang, Feng Liu, Bo Han, Shutao Li, and Mingkui Tan. Physics-driven spatiotemporal modeling for ai-generated video detection. arXiv preprint arXiv:2510.08073, 2025

work page arXiv 2025
[54]

Detecting machine-generated texts by multi-population aware optimization for maximum mean discrep- ancy.arXiv preprint arXiv:2402.16041, 2024

Shuhai Zhang, Yiliao Song, Jiahao Yang, Yuanqing Li, Bo Han, and Mingkui Tan. Detecting machine-generated texts by multi-population aware optimization for maximum mean discrep- ancy.arXiv preprint arXiv:2402.16041, 2024

work page arXiv 2024
[55]

Detecting and simulating artifacts in gan fake images

Xu Zhang, Svebor Karaman, and Shih-Fu Chang. Detecting and simulating artifacts in gan fake images. In2019 IEEE international workshop on information forensics and security (WIFS), pages 1–6. IEEE, 2019

work page 2019
[56]

arXiv preprint arXiv:2511.01293 , year=

Yonggang Zhang, Jun Nie, Xinmei Tian, Mingming Gong, Kun Zhang, and Bo Han. Detecting generated images by fitting natural image distributions.arXiv preprint arXiv:2511.01293, 2025

work page arXiv 2025
[57]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

arXiv preprint arXiv:2311.12397 , year=

Nan Zhong, Yiran Xu, Sheng Li, Zhenxing Qian, and Xinpeng Zhang. Patchcraft: Exploring texture patch for efficient ai-generated image detection.arXiv preprint arXiv:2311.12397, 2023

work page arXiv 2023
[59]

Synthetic lies: Understanding ai-generated misinformation and evaluating algorithmic and human solutions

Jiawei Zhou, Yixuan Zhang, Qianni Luo, Andrea G Parker, and Munmun De Choudhury. Synthetic lies: Understanding ai-generated misinformation and evaluating algorithmic and human solutions. InProceedings of the 2023 CHI conference on human factors in computing systems, pages 1–20, 2023

work page 2023
[60]

Gendet: Towards good generalizations for ai-generated image detection.arXiv preprint arXiv:2312.08880, 2023

Mingjian Zhu, Hanting Chen, Mouxiao Huang, Wei Li, Hailin Hu, Jie Hu, and Yunhe Wang. Gendet: Towards good generalizations for ai-generated image detection.arXiv preprint arXiv:2312.08880, 2023

work page arXiv 2023
[61]

Genimage: A million-scale benchmark for detecting ai-generated image.Advances in Neural Information Processing Systems, 36:77771–77782, 2023

Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. Genimage: A million-scale benchmark for detecting ai-generated image.Advances in Neural Information Processing Systems, 36:77771–77782, 2023. 13 Appendices A Theoretical Analysis 15 A.1 Preliminaries and Modeling Assumptions . . . . . ....

work page 2023
[62]

24 Positivity and monotonicity in ∥∆PFS∥2.Let a:= γ2 γ2+2σ2z Kd 2 >0 and t:=∥∆ PFS∥2 ≥0

Hence, MMD2(P,Q;k ω) = 2 γ2 γ2 + 2σ2z Kd 2 1−exp − K∥∆PFS∥2 2 2(γ2 + 2σ2z) , which proves (11). 24 Positivity and monotonicity in ∥∆PFS∥2.Let a:= γ2 γ2+2σ2z Kd 2 >0 and t:=∥∆ PFS∥2 ≥0 . Then MMD2(P,Q;k ω) = 2a 1−e −t2/(2(γ2+2σ2 z)) . Ift >0, thene −t2/(2(γ2+2σ2 z)) ∈(0,1)and henceMMD 2 >0. Moreover, d dt 1−e −t2/(2(γ2+2σ2 z)) =e −t2/(2(γ2+2σ2 z)) · t γ2 +...

work page
[63]

always fake

learns from diffusion reconstructions and contrastive hard samples to enhance robustness, F- ConV [56] exploits manifold geometry with flow-based extrusion. Motivated by the increasing sparsity of generative artifacts, some methods shift to patch-level evidence. PatchCraft [58] enhances texture traces via smash and reconstruction, FatFormer [ 21] adapts C...

work page arXiv 2023