Can Visual Mamba Improve AI-Generated Image Detection? An In-Depth Investigation

Abdelmalik Taleb-Ahmed; Abdenour Hadid; Hessen Bougueffa Eutamene; Mamadou Keita; Wassim Hamidouche; Xianxun Zhu

arxiv: 2605.14799 · v2 · pith:UVNLUZFNnew · submitted 2026-05-14 · 💻 cs.CV · cs.CR· cs.SI

Can Visual Mamba Improve AI-Generated Image Detection? An In-Depth Investigation

Mamadou Keita , Wassim Hamidouche , Hessen Bougueffa Eutamene , Abdelmalik Taleb-Ahmed , Xianxun Zhu , Abdenour Hadid This is my paper

Pith reviewed 2026-06-30 21:15 UTC · model grok-4.3

classification 💻 cs.CV cs.CRcs.SI

keywords AI-generated image detectionVision Mambaimage forensicssynthetic image classificationdeep learning detectorscomputer vision backbonesgenerative model identification

0 comments

The pith

Vision Mamba models exhibit competitive efficiency yet lower accuracy and weaker generalization than CNNs, ViTs, and VLMs when detecting AI-generated images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a head-to-head benchmark of multiple Vision Mamba variants against CNN, Vision Transformer, and vision-language model detectors on several public datasets containing both real photographs and images produced by GANs and diffusion models. It measures accuracy, inference speed, and how performance holds when the test images come from generators or visual domains not seen during training. A reader would care because scalable, low-cost detectors are needed to flag synthetic content that can spread misinformation or enable fraud. The analysis concludes that Mamba backbones offer a speed advantage but fall short on the core classification task under current training regimes.

Core claim

Vision Mamba architectures, when adapted for binary real-versus-synthetic classification, achieve inference speeds that surpass most transformer baselines while delivering accuracy that remains below the best CNN and VLM detectors; the gap widens on out-of-distribution generators, showing that state-space visual models can contribute to detection pipelines but require additional adaptation to match established methods in reliability.

What carries the argument

Vision Mamba, a selective state-space model backbone for image classification, evaluated here as a drop-in feature extractor for distinguishing authentic from AI-generated images.

If this is right

Mamba-based detectors can reduce computational cost in large-scale screening systems that must process millions of images daily.
The observed accuracy shortfall implies that pure Mamba pipelines may need supplementary modules such as frequency-domain filters or ensemble heads to reach deployment thresholds.
Cross-generator evaluation shows that training on a narrow set of synthetic sources produces brittle detectors, regardless of backbone architecture.
Efficiency gains position Vision Mamba as a candidate for on-device or edge-based detection where latency matters more than marginal accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid architectures that replace only the attention layers of a ViT with Mamba blocks could combine the strengths of both without full retraining.
The speed advantage may prove decisive in video or live-stream settings where frame-by-frame detection is required.
Transfer from Mamba models pretrained on medical or satellite imagery could supply better initial features for the detection task than ImageNet weights alone.

Load-bearing premise

The chosen datasets, generative models, and evaluation metrics sufficiently represent real-world conditions and capture generalizability for AI-generated image detection.

What would settle it

Retraining and testing the same Mamba variants on a fresh dataset of images from a diffusion model released after the paper's experiments, using the identical train-test split protocol, would show whether the reported accuracy gap persists.

Figures

Figures reproduced from arXiv: 2605.14799 by Abdelmalik Taleb-Ahmed, Abdenour Hadid, Hessen Bougueffa Eutamene, Mamadou Keita, Wassim Hamidouche, Xianxun Zhu.

**Figure 1.** Figure 1: Comparison of backbone architectures.Each marker corresponds to a specific model family (ResNet, DeiT, VSSD, ...), and straight lines connect variants belonging to the same architectural family (tiny, small, base, large). The x-axis shows the number of parameters on a logarithmic scale, while the y-axis shows the ImageNet-1K top-1 accuracy reported for each model family in the original papers. The plot hig… view at source ↗

**Figure 2.** Figure 2: Illustration of various scanning strategies used in Mamba models to process visual inputs. Each scan strategy processes image sequences or spatial tokens in a distinct order. This balances computational efficiency, long-range dependency modeling, and fine-grained feature extraction. The scanning approaches shape the flow of information and the receptive field. They ultimately affect the model’s ability to … view at source ↗

**Figure 3.** Figure 3: Architecture of Mamba block [21] 3.2. Selective State Space Model While the classical SSM provides a robust framework for analyzing dynamic systems, it operates under the assumption of linear time-invariant (LTI) dynamics. This implies that the parameters 𝐴, 𝐵, and 𝐶 remain constant over time, limiting the model’s flexibility when dealing with complex, non-stationary input signals. To address this limitati… view at source ↗

**Figure 4.** Figure 4: Illustration of representative Visual Mamba blocks, including Vim [85], VSSD [39], SSD [63], and the MambaVision mixer [22]. The figure delineates the architectural distinctions among these variants, showing how each design tailors the Mamba state-space paradigm to visual processing via dedicated token mixing, spatial scanning approaches, and feature refinement techniques. Together, these modules exemplify… view at source ↗

**Figure 5.** Figure 5: Refined Vim [85] architecture for AI-generated image detection. Vim [85] introduces the first pure SSM-based model for vision tasks. The authors highlight two major challenges of applying SSM to vision tasks: modeling uni-directionality and lack of location awareness. Vim incorporates bidirectional SSM and positional embedding techniques to overcome these challenges. As depicted in [PITH_FULL_IMAGE:figure… view at source ↗

**Figure 6.** Figure 6: Refined MambaVision [22] architecture for AI-generated image detection. more closely with the non-sequential nature of image data. In addition, a symmetric branch has been incorporated into the mixer design. This branch complements the Mamba-based component’s sequential modeling, focusing on spatial features through the implementation of additional convolutional operations. The outputs from both branches a… view at source ↗

**Figure 7.** Figure 7: Refined VSSD [63] architecture for AI-generated image detection. VSSD [63] introduces a novel approach to applying state space models (SSMs) to vision tasks. An overview of the approach is illustrated in [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Each sub-figure is a random image from a testing set, labeled below. The 4-digit binary code shows results from ResNet, Xception, DeiT, and BLIP2 models, where ’0’ means real and ’1’ means fake. All generated images are considered fake. reported in the C2P-CLIP paper. Implementation details. In our experiments, we leveraged the PyTorch deep learning framework on a Linux computer equipped with a 16 GB NVIDI… view at source ↗

**Figure 9.** Figure 9: Performance comparison of multiple models on AntifakePrompt and Bedroom datasets. The x-axis shows model size (parameters), and the y-axis represents accuracy. Model families like SSMs, CNNs, attention-based models, and VLMs are compared to evaluate efficiency and effectiveness trade-offs. smallest model, achieves near-perfect results on real images (99.97%) and LDM (96.93%) but performs poorly on ADM (00.… view at source ↗

**Figure 10.** Figure 10: Performance comparison of models on the UniversalFakeDetect dataset, including frequency-based (Freqspec, Co-occurrence), convolutional (CNN-Spot, F3Net), transformer-based (FatFormer), SSM-based (Vim, VSSD), hybrid (MambaVision, Bi-LORA, LGrad), and multimodal (AntifakePrompt, Bi-LORA, C2P-CLIP) architectures. between real and AI-generated images. This can be attributed to their architectural advantages… view at source ↗

**Figure 11.** Figure 11: t-SNE Visualization of Feature Distributions of Vim Model. The scatter plots illustrate the t-SNE embeddings of features extracted from real (green) and generated (red) images across various generative models, showing how well the features separate real from fake images [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: t-SNE Visualization of Feature Distributions of VSSD Model. The scatter plots illustrate the t-SNE embeddings of features extracted from real (green) and generated (red) images across various generative models, showing how well the features separate real from fake images. features necessary to differentiate between real and synthetic images. Furthermore, the inability of Vision Mamba models to effectively… view at source ↗

**Figure 13.** Figure 13: t-SNE Visualization of Feature Distributions of MambaVision Model. The scatter plots illustrate the t-SNE embeddings of features extracted from real (green) and generated (red) images across various generative models, showing how well the features separate real from fake images. relationships. To overcome these limitations, some approaches introduce multiple scanning techniques to expand the receptive fie… view at source ↗

read the original abstract

In recent years, computer vision has witnessed remarkable progress, fueled by the development of innovative architectures such as Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), diffusion-based architectures, Vision Transformers (ViTs), and, more recently, Vision-Language Models (VLMs). This progress has undeniably contributed to creating increasingly realistic and diverse visual content. However, such advancements in image generation also raise concerns about potential misuse in areas such as misinformation, identity theft, and threats to privacy and security. In parallel, Mamba-based architectures have emerged as versatile tools for a range of image analysis tasks, including classification, segmentation, medical imaging, object detection, and image restoration, in this rapidly evolving field. However, their potential for identifying AI-generated images remains relatively unexplored compared to established techniques. This study provides a systematic evaluation and comparative analysis of Vision Mamba models for AI-generated image detection. We benchmark multiple Vision Mamba variants against representative CNNs, ViTs, and VLM-based detectors across diverse datasets and synthetic image sources, focusing on key metrics such as accuracy, efficiency, and generalizability across diverse image types and generative models. Through this comprehensive analysis, we aim to elucidate Vision Mamba's strengths and limitations relative to established methodologies in terms of applicability, accuracy, and efficiency in detecting AI-generated images. Overall, our findings highlight both the promise and current limitations of Vision Mamba as a component in systems designed to distinguish authentic from AI-generated visual content. This research is crucial for enhancing detection in an age where distinguishing between real and AI-generated content is a major challenge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A solid but incremental benchmark paper that applies Vision Mamba to AI-image detection and reports mixed practical results.

read the letter

The main takeaway is that Vision Mamba shows some promise for spotting AI-generated images but does not clearly outperform established CNN and ViT detectors on the metrics they tested. The work is a straightforward empirical comparison rather than a new method or theoretical advance.

What the paper does well is run a systematic head-to-head on multiple Mamba variants against representative baselines across several datasets and generative sources. It reports accuracy, efficiency, and some cross-model generalization numbers, which is the kind of practical data people building detectors actually need. The abstract is honest about both strengths and limitations, and the authors avoid overclaiming superiority.

The soft spots are typical for this style of paper. The central claims rest on whatever datasets and splits they chose; if those do not cover newer generators or real-world distribution shifts, the generalizability story weakens. There is no new architecture or loss function, so the novelty is entirely in the application and the breadth of the comparison. Without seeing error bars, statistical tests, or ablation details in the full text, it is hard to judge how robust the rankings are. The weakest assumption is that the chosen benchmarks capture the conditions that matter for deployment.

This paper is for people working on detection pipelines who want to know whether swapping in a Mamba backbone is worth trying. It is not for readers looking for architectural innovation or formal guarantees. The experiments look reproducible enough on the surface to deserve referee time, even if the conclusions are likely to be revised. I would send it out for review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a benchmark study evaluating several Vision Mamba variants for the task of distinguishing authentic images from AI-generated ones. It compares these models against representative CNNs, Vision Transformers, and VLM-based detectors across multiple datasets and generative sources, reporting on accuracy, efficiency, and cross-model generalizability, and concludes that Mamba architectures show both promise and current limitations for this application.

Significance. If the empirical comparisons prove robust and reproducible, the work would supply a useful reference point for selecting efficient sequence-modeling backbones in synthetic-media detection pipelines, particularly where computational cost is a concern relative to transformer-based alternatives.

major comments (2)

[Abstract] The abstract states that the study benchmarks 'multiple Vision Mamba variants' and reports 'key metrics such as accuracy, efficiency, and generalizability,' yet no quantitative results, tables, or statistical details (e.g., means, standard deviations, or significance tests) appear in the provided text; without these, the central claim of 'promise and current limitations' cannot be evaluated.
[Abstract / Experimental Setup] The weakest assumption identified—that the chosen datasets and generative models capture real-world generalizability—is load-bearing for the paper's conclusions, but the manuscript supplies no information on data splits, number of runs, or out-of-distribution test sets that would allow readers to assess this assumption.

minor comments (1)

[Introduction] The introduction lists recent architectures but does not cite the original Mamba or Vision Mamba papers; adding these references would improve context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our benchmark study of Vision Mamba for AI-generated image detection. We address each major comment below and will revise the manuscript to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] The abstract states that the study benchmarks 'multiple Vision Mamba variants' and reports 'key metrics such as accuracy, efficiency, and generalizability,' yet no quantitative results, tables, or statistical details (e.g., means, standard deviations, or significance tests) appear in the provided text; without these, the central claim of 'promise and current limitations' cannot be evaluated.

Authors: We agree that the abstract would be strengthened by including key quantitative highlights. In the revised version, we will add specific results such as average accuracies (with standard deviations where multiple runs were performed), efficiency comparisons (e.g., FLOPs or inference time), and a brief note on generalizability trends to better support the claims of promise and limitations. revision: yes
Referee: [Abstract / Experimental Setup] The weakest assumption identified—that the chosen datasets and generative models capture real-world generalizability—is load-bearing for the paper's conclusions, but the manuscript supplies no information on data splits, number of runs, or out-of-distribution test sets that would allow readers to assess this assumption.

Authors: We acknowledge that explicit details on experimental reproducibility are essential. While the Experimental Setup section describes the datasets and generative sources, we will expand it in revision to include precise train/validation/test splits, the number of independent runs with reported means and standard deviations, and any out-of-distribution evaluations to allow readers to better evaluate the generalizability claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical benchmark

full rationale

The paper performs a systematic empirical comparison of Vision Mamba variants against CNNs, ViTs, and VLMs on multiple datasets for AI-generated image detection. It reports accuracy, efficiency, and generalizability metrics from direct experiments with no equations, derivations, fitted parameters relabeled as predictions, or load-bearing self-citations. The abstract and described scope frame the work as an external benchmark study whose results are falsifiable against held-out data and independent implementations. No self-definitional, ansatz-smuggling, or renaming patterns appear. This matches the default expectation for non-circular empirical papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated beyond standard benchmarking assumptions.

pith-pipeline@v0.9.1-grok · 5855 in / 933 out tokens · 21431 ms · 2026-06-30T21:15:21.674507+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

86 extracted references · 35 canonical work pages · 13 internal anchors

[1]

Create with firefly generative ai.https://www.adobe.com/products/firefly.html

Adobe, 2023. Create with firefly generative ai.https://www.adobe.com/products/firefly.html. Accessed: 2024-10-10

2023
[2]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Brock, A., 2018. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Chai, L., Bau, D., Lim, S.N., Isola, P., 2020. What makes fake images detectable? understanding properties that generalize, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16, Springer. pp. 103–120

2020
[4]

Antifakeprompt: Prompt-tuned vision-language models are fake image detectors

Chang, Y.M., Yeh, C., Chiu, W.C., Yu, N., 2023. Antifakeprompt: Prompt-tuned vision-language models are fake image detectors. arXiv preprint arXiv:2310.17419

work page arXiv 2023
[5]

Learning to see in the dark, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp

Chen, C., Chen, Q., Xu, J., Koltun, V., 2018. Learning to see in the dark, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3291–3300

2018
[6]

Photographic image synthesis with cascaded refinement networks, in: Proceedings of the IEEE international conference on computer vision, pp

Chen, Q., Koltun, V., 2017. Photographic image synthesis with cascaded refinement networks, in: Proceedings of the IEEE international conference on computer vision, pp. 1511–1520

2017
[7]

Guidedandfused:Efficientfrozenclip-vitwithfeatureguidanceandmulti-stage feature fusion for generalizable deepfake detection

Chen,Y.,Zhang,L.,Niu,Y.,Chen,P.,Tan,L.,Zhou,J.,2024. Guidedandfused:Efficientfrozenclip-vitwithfeatureguidanceandmulti-stage feature fusion for generalizable deepfake detection. arXiv preprint arXiv:2408.13697

work page arXiv 2024
[8]

Stargan: Unified generative adversarial networks for multi-domain image-to- image translation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp

Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J., 2018. Stargan: Unified generative adversarial networks for multi-domain image-to- image translation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8789–8797

2018
[9]

Xception:Deeplearningwithdepthwiseseparableconvolutions,in:ProceedingsoftheIEEEconferenceoncomputervision and pattern recognition, pp

Chollet,F.,2017. Xception:Deeplearningwithdepthwiseseparableconvolutions,in:ProceedingsoftheIEEEconferenceoncomputervision and pattern recognition, pp. 1251–1258

2017
[10]

arXiv:2312.00195

Cozzolino,D.,Poggi,G.,Corvi,R.,Nießner,M.,Verdoliva,L.,2023. Raisingthebarofai-generatedimagedetectionwithclip. arXivpreprint arXiv:2312.00195

work page arXiv 2023
[11]

Second-order attention network for single image super-resolution, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Dai, T., Cai, J., Zhang, Y., Xia, S.T., Zhang, L., 2019. Second-order attention network for single image super-resolution, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11065–11074

2019
[12]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Dai,W.,Li,J.,Li,D.,Tiong,A.M.H.,Zhao,J.,Wang,W.,Li,B.,Fung,P.,Hoi,S.,2023.Instructblip:Towardsgeneral-purposevision-language models with instruction tuning.arXiv:2305.06500. : Preprint submitted to Elsevier Page 21 of 24

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Dao, T., Gu, A., 2024. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Vision Transformers Need Registers

Darcet, T., Oquab, M., Mairal, J., Bojanowski, P., 2023. Vision transformers need registers. arXiv preprint arXiv:2309.16588

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Diffusion models beat gans on image synthesis

Dhariwal, P., Nichol, A., 2021. Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34, 8780–8794

2021
[16]

Cogview2:Fasterandbettertext-to-imagegenerationviahierarchicaltransformers

Ding,M.,Zheng,W.,Hong,W.,Tang,J.,2022. Cogview2:Fasterandbettertext-to-imagegenerationviahierarchicaltransformers. Advances in Neural Information Processing Systems 35, 16890–16902

2022
[17]

Fusion-mambaforcross-modalityobjectdetection

Dong,W.,Zhu,H.,Lin,S.,Luo,X.,Shen,Y.,Liu,X.,Zhang,J.,Guo,G.,Zhang,B.,2024. Fusion-mambaforcross-modalityobjectdetection. arXiv preprint arXiv:2404.09146

work page arXiv 2024
[18]

A synthetic data generation system based on the variational-autoencoder technique and the linked data paradigm

Dos Santos, R., Aguilar, J., 2024. A synthetic data generation system based on the variational-autoencoder technique and the linked data paradigm. Progress in Artificial Intelligence , 1–15

2024
[19]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2020
[20]

Generativeadversarialnets

Goodfellow,I.,Pouget-Abadie,J.,Mirza,M.,Xu,B.,Warde-Farley,D.,Ozair,S.,Courville,A.,Bengio,Y.,2014. Generativeadversarialnets. Advances in neural information processing systems 27

2014
[21]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A., Dao, T., 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Mambavision: Ahybridmamba-transformer visionbackbone,in: Proceedingsofthe ComputerVisionand Pattern Recognition Conference, pp

Hatamizadeh,A., Kautz,J.,2025. Mambavision: Ahybridmamba-transformer visionbackbone,in: Proceedingsofthe ComputerVisionand Pattern Recognition Conference, pp. 25261–25270

2025
[23]

Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp

He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778

2016
[24]

Ho,J.,Jain,A.,Abbeel,P.,2020.Denoisingdiffusionprobabilisticmodels.Advancesinneuralinformationprocessingsystems33,6840–6851

2020
[25]

Localmamba: Visual state space model with windowed selective scan

Huang, T., Pei, X., You, S., Wang, F., Qian, C., Xu, C., 2024a. Localmamba: Visual state space model with windowed selective scan. arXiv preprint arXiv:2403.09338

work page arXiv
[26]

Ffaa: Multimodal large language model based explainable open-world face forgery analysis assistant

Huang, Z., Xia, B., Lin, Z., Mou, Z., Yang, W., 2024b. Ffaa: Multimodal large language model based explainable open-world face forgery analysis assistant. arXiv preprint arXiv:2408.10072

work page arXiv
[27]

Synthetic face discrimination via learned image compression

Iliopoulou, S., Tsinganos, P., Ampeliotis, D., Skodras, A., 2024. Synthetic face discrimination via learned image compression. Algorithms 17, 375

2024
[28]

Evolutionofdetectionperformancethroughout the online lifespan of synthetic images, in: European Conference on Computer Vision, Springer

Karageogiou,D.,Bammey,Q.,Porcellini,V.,Goupil,B.,Teyssou,D.,Papadopoulos,S.,2024. Evolutionofdetectionperformancethroughout the online lifespan of synthetic images, in: European Conference on Computer Vision, Springer. pp. 400–417

2024
[29]

Progressive Growing of GANs for Improved Quality, Stability, and Variation

Karras, T., 2017. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

Alias-freegenerativeadversarialnetworks

Karras,T.,Aittala,M.,Laine,S.,Härkönen,E.,Hellsten,J.,Lehtinen,J.,Aila,T.,2021. Alias-freegenerativeadversarialnetworks. Advances in neural information processing systems 34, 852–863

2021
[31]

Astyle-basedgeneratorarchitectureforgenerativeadversarialnetworks,in:ProceedingsoftheIEEE/CVF conference on computer vision and pattern recognition, pp

Karras,T.,Laine,S.,Aila,T.,2019. Astyle-basedgeneratorarchitectureforgenerativeadversarialnetworks,in:ProceedingsoftheIEEE/CVF conference on computer vision and pattern recognition, pp. 4401–4410

2019
[32]

Harnessing the power of large vision language models for synthetic image detection

Keita, M., Hamidouche, W., Bougueffa, H., Hadid, A., Taleb-Ahmed, A., 2024. Harnessing the power of large vision language models for synthetic image detection. arXiv preprint arXiv:2404.02726

work page arXiv 2024
[33]

Bi-lora:Avision-languageapproach for synthetic image detection

Keita,M.,Hamidouche,W.,BougueffaEutamene,H.,Taleb-Ahmed,A.,Camacho,D.,Hadid,A.,2025. Bi-lora:Avision-languageapproach for synthetic image detection. Expert Systems 42, e13829

2025
[34]

Texturecrop: Enhancing synthetic image detection through texture-based cropping

Konstantinidou, D., Koutlis, C., Papadopoulos, S., 2024. Texturecrop: Enhancing synthetic image detection through texture-based cropping. arXiv preprint arXiv:2407.15500

work page arXiv 2024
[35]

Diverse image synthesis from semantic layouts via conditional imle, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Li, K., Zhang, T., Malik, J., 2019. Diverse image synthesis from semantic layouts via conditional imle, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4220–4229

2019
[36]

Mamba-nd:Selectivestatespacemodelingformulti-dimensionaldata

Li,S.,Singh,H.,Grover,A.,2024. Mamba-nd:Selectivestatespacemodelingformulti-dimensionaldata. arXivpreprintarXiv:2402.05892

work page arXiv 2024
[37]

Forgery-awareadaptivetransformerforgeneralizablesyntheticimagedetection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Liu,H.,Tan,Z.,Tan,C.,Wei,Y.,Wang,J.,Zhao,Y.,2024a. Forgery-awareadaptivetransformerforgeneralizablesyntheticimagedetection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10770–10780
[38]

arXiv preprint arXiv:2202.09778 (2022)

Liu, L., Ren, Y., Lin, Z., Zhao, Z., 2022. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778

work page arXiv 2022
[39]

VMamba: Visual State Space Model

Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Liu, Y., 2024b. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166

work page internal anchor Pith review Pith/arXiv arXiv
[40]

U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation

Ma, J., Li, F., Wang, B., 2024. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Midjourney v5.https://www.midjourney.com

MidJourney, 2023. Midjourney v5.https://www.midjourney.com. Accessed: 2024-10-10

2023
[42]

Detecting gan generated fake images using co-occurrence matrices

Nataraj, L., Mohammed, T.M., Chandrasekaran, S., Flenner, A., Bappy, J.H., Roy-Chowdhury, A.K., Manjunath, B., 2019. Detecting gan generated fake images using co-occurrence matrices. arXiv preprint arXiv:1903.06836

work page arXiv 2019
[43]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Nichol,A.,Dhariwal,P.,Ramesh,A.,Shyam,P.,Mishkin,P.,McGrew,B.,Sutskever,I.,Chen,M.,2021. Glide:Towardsphotorealisticimage generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741

work page internal anchor Pith review Pith/arXiv arXiv 2021
[44]

Improved denoising diffusion probabilistic models, in: International Conference on Machine Learning, PMLR

Nichol, A.Q., Dhariwal, P., 2021. Improved denoising diffusion probabilistic models, in: International Conference on Machine Learning, PMLR. pp. 8162–8171

2021
[45]

Towards universal fake image detectors that generalize across generative models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Ojha, U., Li, Y., Lee, Y.J., 2023. Towards universal fake image detectors that generalize across generative models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24480–24489

2023
[46]

Dall-e 3.https://openai.com/dall-e-3

OpenAI, 2023. Dall-e 3.https://openai.com/dall-e-3. Accessed: 2024-10-10

2023
[47]

Theaffectivenatureofai-generatednewsimages:Impact on visual journalism, in: 2023 11th International Conference on Affective Computing and Intelligent Interaction (ACII), IEEE

Paik,S.,Bonna,S.,Novozhilova,E.,Gao,G.,Kim,J.,Wijaya,D.,Betke,M.,2023. Theaffectivenatureofai-generatednewsimages:Impact on visual journalism, in: 2023 11th International Conference on Affective Computing and Intelligent Interaction (ACII), IEEE. pp. 1–8

2023
[48]

Gaugan: semantic image synthesis with spatially adaptive normalization, in: ACM SIGGRAPH 2019 Real-Time Live!, pp

Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y., 2019. Gaugan: semantic image synthesis with spatially adaptive normalization, in: ACM SIGGRAPH 2019 Real-Time Live!, pp. 1–1. : Preprint submitted to Elsevier Page 22 of 24

2019
[49]

Simba: Simplified mamba-based architecture for vision and multivariate time series

Patro, B.N., Agneeswaran, V.S., 2024. Simba: Simplified mamba-based architecture for vision and multivariate time series. arXiv preprint arXiv:2403.15360

work page arXiv 2024
[50]

Efficientvmamba: Atrous selective scan for light weight visual mamba

Pei, X., Huang, T., Xu, C., 2024. Efficientvmamba: Atrous selective scan for light weight visual mamba. arXiv preprint arXiv:2403.09977

work page arXiv 2024
[51]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R., 2023. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Thinking in frequency: Face forgery detection by mining frequency-aware clues, in: European conference on computer vision, Springer

Qian, Y., Yin, G., Sheng, L., Chen, Z., Shao, J., 2020. Thinking in frequency: Face forgery detection by mining frequency-aware clues, in: European conference on computer vision, Springer. pp. 86–103

2020
[53]

Zero-shot text-to-image generation, in: International conference on machine learning, Pmlr

Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I., 2021. Zero-shot text-to-image generation, in: International conference on machine learning, Pmlr. pp. 8821–8831

2021
[54]

Autoregressive pretraining with mamba in vision

Ren, S., Li, X., Tu, H., Wang, F., Shu, F., Zhang, L., Mei, J., Yang, L., Wang, P., Wang, H., et al., 2024. Autoregressive pretraining with mamba in vision. arXiv preprint arXiv:2406.07537

work page arXiv 2024
[55]

arXiv:2210.14571

Ricker, J., Damm, S., Holz, T., Fischer, A., 2022. Towards the detection of diffusion model deepfakes. arXiv preprint arXiv:2210.14571

work page arXiv 2022
[56]

High-resolution image synthesis with latent diffusion models, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B., 2022. High-resolution image synthesis with latent diffusion models, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695

2022
[57]

Faceforensics++: Learning to detect manipulated facial images, in: Proceedings of the IEEE/CVF international conference on computer vision, pp

Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M., 2019. Faceforensics++: Learning to detect manipulated facial images, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 1–11

2019
[58]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K., 2023. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 22500–22510

2023
[59]

Saharia,C.,Chan,W.,Saxena,S.,Li,L.,Whang,J.,Denton,E.L.,Ghasemipour,K.,GontijoLopes,R.,KaragolAyan,B.,Salimans,T.,etal.,
[60]

Advances in neural information processing systems 35, 36479–36494

Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems 35, 36479–36494
[61]

Stylegan-xl: Scaling stylegan to large diverse datasets, in: ACM SIGGRAPH 2022 conference proceedings, pp

Sauer, A., Schwarz, K., Geiger, A., 2022. Stylegan-xl: Scaling stylegan to large diverse datasets, in: ACM SIGGRAPH 2022 conference proceedings, pp. 1–10

2022
[62]

Instantbooth:Personalizedtext-to-imagegenerationwithouttest-timefinetuning,in:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Shi,J.,Xiong,W.,Lin,Z.,Jung,H.J.,2024a. Instantbooth:Personalizedtext-to-imagegenerationwithouttest-timefinetuning,in:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8543–8552
[63]

Multi-scale vmamba: Hierarchy in hierarchy visual state space model

Shi, Y., Dong, M., Xu, C., 2024b. Multi-scale vmamba: Hierarchy in hierarchy visual state space model. arXiv preprint arXiv:2405.14174

work page arXiv
[64]

10819–10829

Shi,Y.,Li,M.,Dong,M.,Xu,C.,2025.Vssd:Visionmambawithnon-causalstatespaceduality,in:ProceedingsoftheIEEE/CVFInternational Conference on Computer Vision, pp. 10819–10829

2025
[65]

Deep unsupervised learning using nonequilibrium thermodynamics, in: International conference on machine learning, PMLR

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S., 2015. Deep unsupervised learning using nonequilibrium thermodynamics, in: International conference on machine learning, PMLR. pp. 2256–2265

2015
[66]

Mamba: Multi-level aggregation via memory bank for video object detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp

Sun, G., Hua, Y., Hu, G., Robertson, N., 2021. Mamba: Multi-level aggregation via memory bank for video object detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2620–2627

2021
[67]

7184–7192

Tan,C.,Tao,R.,Liu,H.,Gu,G.,Wu,B.,Zhao,Y.,Wei,Y.,2025.C2p-clip:Injectingcategorycommonpromptincliptoenhancegeneralization in deepfake detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7184–7192

2025
[68]

Frequency-awaredeepfakedetection:Improvinggeneralizabilitythroughfrequency space domain learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp

Tan,C.,Zhao,Y.,Wei,S.,Gu,G.,Liu,P.,Wei,Y.,2024a. Frequency-awaredeepfakedetection:Improvinggeneralizabilitythroughfrequency space domain learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 5052–5060
[69]

Tan, C., Zhao, Y., Wei, S., Gu, G., Liu, P., Wei, Y., 2024b. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 28130– 28139
[70]

Tan, C., Zhao, Y., Wei, S., Gu, G., Wei, Y., 2023. Learning on gradients: Generalized artifacts representation for gan-generated images detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12105–12114

2023
[71]

Scalable visual state space model with fractal scanning

Tang, L., Xiao, H., Jiang, P.T., Zhang, H., Chen, J., Li, B., 2024. Scalable visual state space model with fractal scanning. arXiv preprint arXiv:2405.14480

work page arXiv 2024
[72]

Training data-efficient image transformers & distillation through attention, in: International conference on machine learning, PMLR

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H., 2021. Training data-efficient image transformers & distillation through attention, in: International conference on machine learning, PMLR. pp. 10347–10357

2021
[73]

Powersgd: Practical low-rank gradient compression for distributed optimization

Vogels, T., Karimireddy, S.P., Jaggi, M., 2019. Powersgd: Practical low-rank gradient compression for distributed optimization. Advances in Neural Information Processing Systems 32

2019
[74]

Mamba-r: Vision mamba also needs registers

Wang, F., Wang, J., Ren, S., Wei, G., Mei, J., Shao, W., Zhou, Y., Yuille, A., Xie, C., 2024. Mamba-r: Vision mamba also needs registers. arXiv preprint arXiv:2405.14858

work page arXiv 2024
[75]

Cnn-generatedimagesaresurprisinglyeasytospot...fornow,in:Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Wang,S.Y.,Wang,O.,Zhang,R.,Owens,A.,Efros,A.A.,2020. Cnn-generatedimagesaresurprisinglyeasytospot...fornow,in:Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8695–8704

2020
[76]

Dire for diffusion-generated image detection

Wang, Z., Bao, J., Zhou, W., Wang, W., Hu, H., Chen, H., Li, H., 2023. Dire for diffusion-generated image detection. arXiv preprint arXiv:2303.09295

work page arXiv 2023
[77]

Fd-gan: Generalizable and robust forgery detection via generative adversarial networks

Xu, N., Feng, W., Zhang, T., Zhang, Y., 2024. Fd-gan: Generalizable and robust forgery detection via generative adversarial networks. International Journal of Computer Vision , 1–19

2024
[78]

Tall: Thumbnail layout for deepfake video detection, in: Proceedings of the IEEE/CVF international conference on computer vision, pp

Xu, Y., Liang, J., Jia, G., Yang, Z., Zhang, Y., He, R., 2023. Tall: Thumbnail layout for deepfake video detection, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 22658–22668

2023
[79]

Raphael: Text-to-image generation via large mixture of diffusion paths

Xue, Z., Song, G., Guo, Q., Liu, B., Zong, Z., Liu, Y., Luo, P., 2024. Raphael: Text-to-image generation via large mixture of diffusion paths. Advances in Neural Information Processing Systems 36

2024
[80]

Plainmamba: Improving non-hierarchical mamba in visual recognition

Yang, C., Chen, Z., Espinosa, M., Ericsson, L., Wang, Z., Liu, J., Crowley, E.J., 2024. Plainmamba: Improving non-hierarchical mamba in visual recognition. arXiv preprint arXiv:2403.17695

work page arXiv 2024

Showing first 80 references.

[1] [1]

Create with firefly generative ai.https://www.adobe.com/products/firefly.html

Adobe, 2023. Create with firefly generative ai.https://www.adobe.com/products/firefly.html. Accessed: 2024-10-10

2023

[2] [2]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Brock, A., 2018. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

Chai, L., Bau, D., Lim, S.N., Isola, P., 2020. What makes fake images detectable? understanding properties that generalize, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16, Springer. pp. 103–120

2020

[4] [4]

Antifakeprompt: Prompt-tuned vision-language models are fake image detectors

Chang, Y.M., Yeh, C., Chiu, W.C., Yu, N., 2023. Antifakeprompt: Prompt-tuned vision-language models are fake image detectors. arXiv preprint arXiv:2310.17419

work page arXiv 2023

[5] [5]

Learning to see in the dark, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp

Chen, C., Chen, Q., Xu, J., Koltun, V., 2018. Learning to see in the dark, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3291–3300

2018

[6] [6]

Photographic image synthesis with cascaded refinement networks, in: Proceedings of the IEEE international conference on computer vision, pp

Chen, Q., Koltun, V., 2017. Photographic image synthesis with cascaded refinement networks, in: Proceedings of the IEEE international conference on computer vision, pp. 1511–1520

2017

[7] [7]

Guidedandfused:Efficientfrozenclip-vitwithfeatureguidanceandmulti-stage feature fusion for generalizable deepfake detection

Chen,Y.,Zhang,L.,Niu,Y.,Chen,P.,Tan,L.,Zhou,J.,2024. Guidedandfused:Efficientfrozenclip-vitwithfeatureguidanceandmulti-stage feature fusion for generalizable deepfake detection. arXiv preprint arXiv:2408.13697

work page arXiv 2024

[8] [8]

Stargan: Unified generative adversarial networks for multi-domain image-to- image translation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp

Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J., 2018. Stargan: Unified generative adversarial networks for multi-domain image-to- image translation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8789–8797

2018

[9] [9]

Xception:Deeplearningwithdepthwiseseparableconvolutions,in:ProceedingsoftheIEEEconferenceoncomputervision and pattern recognition, pp

Chollet,F.,2017. Xception:Deeplearningwithdepthwiseseparableconvolutions,in:ProceedingsoftheIEEEconferenceoncomputervision and pattern recognition, pp. 1251–1258

2017

[10] [10]

arXiv:2312.00195

Cozzolino,D.,Poggi,G.,Corvi,R.,Nießner,M.,Verdoliva,L.,2023. Raisingthebarofai-generatedimagedetectionwithclip. arXivpreprint arXiv:2312.00195

work page arXiv 2023

[11] [11]

Second-order attention network for single image super-resolution, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Dai, T., Cai, J., Zhang, Y., Xia, S.T., Zhang, L., 2019. Second-order attention network for single image super-resolution, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11065–11074

2019

[12] [12]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Dai,W.,Li,J.,Li,D.,Tiong,A.M.H.,Zhao,J.,Wang,W.,Li,B.,Fung,P.,Hoi,S.,2023.Instructblip:Towardsgeneral-purposevision-language models with instruction tuning.arXiv:2305.06500. : Preprint submitted to Elsevier Page 21 of 24

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Dao, T., Gu, A., 2024. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Vision Transformers Need Registers

Darcet, T., Oquab, M., Mairal, J., Bojanowski, P., 2023. Vision transformers need registers. arXiv preprint arXiv:2309.16588

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Diffusion models beat gans on image synthesis

Dhariwal, P., Nichol, A., 2021. Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34, 8780–8794

2021

[16] [16]

Cogview2:Fasterandbettertext-to-imagegenerationviahierarchicaltransformers

Ding,M.,Zheng,W.,Hong,W.,Tang,J.,2022. Cogview2:Fasterandbettertext-to-imagegenerationviahierarchicaltransformers. Advances in Neural Information Processing Systems 35, 16890–16902

2022

[17] [17]

Fusion-mambaforcross-modalityobjectdetection

Dong,W.,Zhu,H.,Lin,S.,Luo,X.,Shen,Y.,Liu,X.,Zhang,J.,Guo,G.,Zhang,B.,2024. Fusion-mambaforcross-modalityobjectdetection. arXiv preprint arXiv:2404.09146

work page arXiv 2024

[18] [18]

A synthetic data generation system based on the variational-autoencoder technique and the linked data paradigm

Dos Santos, R., Aguilar, J., 2024. A synthetic data generation system based on the variational-autoencoder technique and the linked data paradigm. Progress in Artificial Intelligence , 1–15

2024

[19] [19]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2020

[20] [20]

Generativeadversarialnets

Goodfellow,I.,Pouget-Abadie,J.,Mirza,M.,Xu,B.,Warde-Farley,D.,Ozair,S.,Courville,A.,Bengio,Y.,2014. Generativeadversarialnets. Advances in neural information processing systems 27

2014

[21] [21]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A., Dao, T., 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Mambavision: Ahybridmamba-transformer visionbackbone,in: Proceedingsofthe ComputerVisionand Pattern Recognition Conference, pp

Hatamizadeh,A., Kautz,J.,2025. Mambavision: Ahybridmamba-transformer visionbackbone,in: Proceedingsofthe ComputerVisionand Pattern Recognition Conference, pp. 25261–25270

2025

[23] [23]

Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp

He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778

2016

[24] [24]

Ho,J.,Jain,A.,Abbeel,P.,2020.Denoisingdiffusionprobabilisticmodels.Advancesinneuralinformationprocessingsystems33,6840–6851

2020

[25] [25]

Localmamba: Visual state space model with windowed selective scan

Huang, T., Pei, X., You, S., Wang, F., Qian, C., Xu, C., 2024a. Localmamba: Visual state space model with windowed selective scan. arXiv preprint arXiv:2403.09338

work page arXiv

[26] [26]

Ffaa: Multimodal large language model based explainable open-world face forgery analysis assistant

Huang, Z., Xia, B., Lin, Z., Mou, Z., Yang, W., 2024b. Ffaa: Multimodal large language model based explainable open-world face forgery analysis assistant. arXiv preprint arXiv:2408.10072

work page arXiv

[27] [27]

Synthetic face discrimination via learned image compression

Iliopoulou, S., Tsinganos, P., Ampeliotis, D., Skodras, A., 2024. Synthetic face discrimination via learned image compression. Algorithms 17, 375

2024

[28] [28]

Evolutionofdetectionperformancethroughout the online lifespan of synthetic images, in: European Conference on Computer Vision, Springer

Karageogiou,D.,Bammey,Q.,Porcellini,V.,Goupil,B.,Teyssou,D.,Papadopoulos,S.,2024. Evolutionofdetectionperformancethroughout the online lifespan of synthetic images, in: European Conference on Computer Vision, Springer. pp. 400–417

2024

[29] [29]

Progressive Growing of GANs for Improved Quality, Stability, and Variation

Karras, T., 2017. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196

work page internal anchor Pith review Pith/arXiv arXiv 2017

[30] [30]

Alias-freegenerativeadversarialnetworks

Karras,T.,Aittala,M.,Laine,S.,Härkönen,E.,Hellsten,J.,Lehtinen,J.,Aila,T.,2021. Alias-freegenerativeadversarialnetworks. Advances in neural information processing systems 34, 852–863

2021

[31] [31]

Astyle-basedgeneratorarchitectureforgenerativeadversarialnetworks,in:ProceedingsoftheIEEE/CVF conference on computer vision and pattern recognition, pp

Karras,T.,Laine,S.,Aila,T.,2019. Astyle-basedgeneratorarchitectureforgenerativeadversarialnetworks,in:ProceedingsoftheIEEE/CVF conference on computer vision and pattern recognition, pp. 4401–4410

2019

[32] [32]

Harnessing the power of large vision language models for synthetic image detection

Keita, M., Hamidouche, W., Bougueffa, H., Hadid, A., Taleb-Ahmed, A., 2024. Harnessing the power of large vision language models for synthetic image detection. arXiv preprint arXiv:2404.02726

work page arXiv 2024

[33] [33]

Bi-lora:Avision-languageapproach for synthetic image detection

Keita,M.,Hamidouche,W.,BougueffaEutamene,H.,Taleb-Ahmed,A.,Camacho,D.,Hadid,A.,2025. Bi-lora:Avision-languageapproach for synthetic image detection. Expert Systems 42, e13829

2025

[34] [34]

Texturecrop: Enhancing synthetic image detection through texture-based cropping

Konstantinidou, D., Koutlis, C., Papadopoulos, S., 2024. Texturecrop: Enhancing synthetic image detection through texture-based cropping. arXiv preprint arXiv:2407.15500

work page arXiv 2024

[35] [35]

Diverse image synthesis from semantic layouts via conditional imle, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Li, K., Zhang, T., Malik, J., 2019. Diverse image synthesis from semantic layouts via conditional imle, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4220–4229

2019

[36] [36]

Mamba-nd:Selectivestatespacemodelingformulti-dimensionaldata

Li,S.,Singh,H.,Grover,A.,2024. Mamba-nd:Selectivestatespacemodelingformulti-dimensionaldata. arXivpreprintarXiv:2402.05892

work page arXiv 2024

[37] [37]

Forgery-awareadaptivetransformerforgeneralizablesyntheticimagedetection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Liu,H.,Tan,Z.,Tan,C.,Wei,Y.,Wang,J.,Zhao,Y.,2024a. Forgery-awareadaptivetransformerforgeneralizablesyntheticimagedetection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10770–10780

[38] [38]

arXiv preprint arXiv:2202.09778 (2022)

Liu, L., Ren, Y., Lin, Z., Zhao, Z., 2022. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778

work page arXiv 2022

[39] [39]

VMamba: Visual State Space Model

Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Liu, Y., 2024b. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation

Ma, J., Li, F., Wang, B., 2024. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Midjourney v5.https://www.midjourney.com

MidJourney, 2023. Midjourney v5.https://www.midjourney.com. Accessed: 2024-10-10

2023

[42] [42]

Detecting gan generated fake images using co-occurrence matrices

Nataraj, L., Mohammed, T.M., Chandrasekaran, S., Flenner, A., Bappy, J.H., Roy-Chowdhury, A.K., Manjunath, B., 2019. Detecting gan generated fake images using co-occurrence matrices. arXiv preprint arXiv:1903.06836

work page arXiv 2019

[43] [43]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Nichol,A.,Dhariwal,P.,Ramesh,A.,Shyam,P.,Mishkin,P.,McGrew,B.,Sutskever,I.,Chen,M.,2021. Glide:Towardsphotorealisticimage generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741

work page internal anchor Pith review Pith/arXiv arXiv 2021

[44] [44]

Improved denoising diffusion probabilistic models, in: International Conference on Machine Learning, PMLR

Nichol, A.Q., Dhariwal, P., 2021. Improved denoising diffusion probabilistic models, in: International Conference on Machine Learning, PMLR. pp. 8162–8171

2021

[45] [45]

Towards universal fake image detectors that generalize across generative models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Ojha, U., Li, Y., Lee, Y.J., 2023. Towards universal fake image detectors that generalize across generative models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24480–24489

2023

[46] [46]

Dall-e 3.https://openai.com/dall-e-3

OpenAI, 2023. Dall-e 3.https://openai.com/dall-e-3. Accessed: 2024-10-10

2023

[47] [47]

Theaffectivenatureofai-generatednewsimages:Impact on visual journalism, in: 2023 11th International Conference on Affective Computing and Intelligent Interaction (ACII), IEEE

Paik,S.,Bonna,S.,Novozhilova,E.,Gao,G.,Kim,J.,Wijaya,D.,Betke,M.,2023. Theaffectivenatureofai-generatednewsimages:Impact on visual journalism, in: 2023 11th International Conference on Affective Computing and Intelligent Interaction (ACII), IEEE. pp. 1–8

2023

[48] [48]

Gaugan: semantic image synthesis with spatially adaptive normalization, in: ACM SIGGRAPH 2019 Real-Time Live!, pp

Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y., 2019. Gaugan: semantic image synthesis with spatially adaptive normalization, in: ACM SIGGRAPH 2019 Real-Time Live!, pp. 1–1. : Preprint submitted to Elsevier Page 22 of 24

2019

[49] [49]

Simba: Simplified mamba-based architecture for vision and multivariate time series

Patro, B.N., Agneeswaran, V.S., 2024. Simba: Simplified mamba-based architecture for vision and multivariate time series. arXiv preprint arXiv:2403.15360

work page arXiv 2024

[50] [50]

Efficientvmamba: Atrous selective scan for light weight visual mamba

Pei, X., Huang, T., Xu, C., 2024. Efficientvmamba: Atrous selective scan for light weight visual mamba. arXiv preprint arXiv:2403.09977

work page arXiv 2024

[51] [51]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R., 2023. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

Thinking in frequency: Face forgery detection by mining frequency-aware clues, in: European conference on computer vision, Springer

Qian, Y., Yin, G., Sheng, L., Chen, Z., Shao, J., 2020. Thinking in frequency: Face forgery detection by mining frequency-aware clues, in: European conference on computer vision, Springer. pp. 86–103

2020

[53] [53]

Zero-shot text-to-image generation, in: International conference on machine learning, Pmlr

Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I., 2021. Zero-shot text-to-image generation, in: International conference on machine learning, Pmlr. pp. 8821–8831

2021

[54] [54]

Autoregressive pretraining with mamba in vision

Ren, S., Li, X., Tu, H., Wang, F., Shu, F., Zhang, L., Mei, J., Yang, L., Wang, P., Wang, H., et al., 2024. Autoregressive pretraining with mamba in vision. arXiv preprint arXiv:2406.07537

work page arXiv 2024

[55] [55]

arXiv:2210.14571

Ricker, J., Damm, S., Holz, T., Fischer, A., 2022. Towards the detection of diffusion model deepfakes. arXiv preprint arXiv:2210.14571

work page arXiv 2022

[56] [56]

High-resolution image synthesis with latent diffusion models, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B., 2022. High-resolution image synthesis with latent diffusion models, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695

2022

[57] [57]

Faceforensics++: Learning to detect manipulated facial images, in: Proceedings of the IEEE/CVF international conference on computer vision, pp

Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M., 2019. Faceforensics++: Learning to detect manipulated facial images, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 1–11

2019

[58] [58]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K., 2023. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 22500–22510

2023

[59] [59]

Saharia,C.,Chan,W.,Saxena,S.,Li,L.,Whang,J.,Denton,E.L.,Ghasemipour,K.,GontijoLopes,R.,KaragolAyan,B.,Salimans,T.,etal.,

[60] [60]

Advances in neural information processing systems 35, 36479–36494

Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems 35, 36479–36494

[61] [61]

Stylegan-xl: Scaling stylegan to large diverse datasets, in: ACM SIGGRAPH 2022 conference proceedings, pp

Sauer, A., Schwarz, K., Geiger, A., 2022. Stylegan-xl: Scaling stylegan to large diverse datasets, in: ACM SIGGRAPH 2022 conference proceedings, pp. 1–10

2022

[62] [62]

Instantbooth:Personalizedtext-to-imagegenerationwithouttest-timefinetuning,in:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Shi,J.,Xiong,W.,Lin,Z.,Jung,H.J.,2024a. Instantbooth:Personalizedtext-to-imagegenerationwithouttest-timefinetuning,in:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8543–8552

[63] [63]

Multi-scale vmamba: Hierarchy in hierarchy visual state space model

Shi, Y., Dong, M., Xu, C., 2024b. Multi-scale vmamba: Hierarchy in hierarchy visual state space model. arXiv preprint arXiv:2405.14174

work page arXiv

[64] [64]

10819–10829

Shi,Y.,Li,M.,Dong,M.,Xu,C.,2025.Vssd:Visionmambawithnon-causalstatespaceduality,in:ProceedingsoftheIEEE/CVFInternational Conference on Computer Vision, pp. 10819–10829

2025

[65] [65]

Deep unsupervised learning using nonequilibrium thermodynamics, in: International conference on machine learning, PMLR

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S., 2015. Deep unsupervised learning using nonequilibrium thermodynamics, in: International conference on machine learning, PMLR. pp. 2256–2265

2015

[66] [66]

Mamba: Multi-level aggregation via memory bank for video object detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp

Sun, G., Hua, Y., Hu, G., Robertson, N., 2021. Mamba: Multi-level aggregation via memory bank for video object detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2620–2627

2021

[67] [67]

7184–7192

Tan,C.,Tao,R.,Liu,H.,Gu,G.,Wu,B.,Zhao,Y.,Wei,Y.,2025.C2p-clip:Injectingcategorycommonpromptincliptoenhancegeneralization in deepfake detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7184–7192

2025

[68] [68]

Frequency-awaredeepfakedetection:Improvinggeneralizabilitythroughfrequency space domain learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp

Tan,C.,Zhao,Y.,Wei,S.,Gu,G.,Liu,P.,Wei,Y.,2024a. Frequency-awaredeepfakedetection:Improvinggeneralizabilitythroughfrequency space domain learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 5052–5060

[69] [69]

Tan, C., Zhao, Y., Wei, S., Gu, G., Liu, P., Wei, Y., 2024b. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 28130– 28139

[70] [70]

Tan, C., Zhao, Y., Wei, S., Gu, G., Wei, Y., 2023. Learning on gradients: Generalized artifacts representation for gan-generated images detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12105–12114

2023

[71] [71]

Scalable visual state space model with fractal scanning

Tang, L., Xiao, H., Jiang, P.T., Zhang, H., Chen, J., Li, B., 2024. Scalable visual state space model with fractal scanning. arXiv preprint arXiv:2405.14480

work page arXiv 2024

[72] [72]

Training data-efficient image transformers & distillation through attention, in: International conference on machine learning, PMLR

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H., 2021. Training data-efficient image transformers & distillation through attention, in: International conference on machine learning, PMLR. pp. 10347–10357

2021

[73] [73]

Powersgd: Practical low-rank gradient compression for distributed optimization

Vogels, T., Karimireddy, S.P., Jaggi, M., 2019. Powersgd: Practical low-rank gradient compression for distributed optimization. Advances in Neural Information Processing Systems 32

2019

[74] [74]

Mamba-r: Vision mamba also needs registers

Wang, F., Wang, J., Ren, S., Wei, G., Mei, J., Shao, W., Zhou, Y., Yuille, A., Xie, C., 2024. Mamba-r: Vision mamba also needs registers. arXiv preprint arXiv:2405.14858

work page arXiv 2024

[75] [75]

Cnn-generatedimagesaresurprisinglyeasytospot...fornow,in:Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Wang,S.Y.,Wang,O.,Zhang,R.,Owens,A.,Efros,A.A.,2020. Cnn-generatedimagesaresurprisinglyeasytospot...fornow,in:Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8695–8704

2020

[76] [76]

Dire for diffusion-generated image detection

Wang, Z., Bao, J., Zhou, W., Wang, W., Hu, H., Chen, H., Li, H., 2023. Dire for diffusion-generated image detection. arXiv preprint arXiv:2303.09295

work page arXiv 2023

[77] [77]

Fd-gan: Generalizable and robust forgery detection via generative adversarial networks

Xu, N., Feng, W., Zhang, T., Zhang, Y., 2024. Fd-gan: Generalizable and robust forgery detection via generative adversarial networks. International Journal of Computer Vision , 1–19

2024

[78] [78]

Tall: Thumbnail layout for deepfake video detection, in: Proceedings of the IEEE/CVF international conference on computer vision, pp

Xu, Y., Liang, J., Jia, G., Yang, Z., Zhang, Y., He, R., 2023. Tall: Thumbnail layout for deepfake video detection, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 22658–22668

2023

[79] [79]

Raphael: Text-to-image generation via large mixture of diffusion paths

Xue, Z., Song, G., Guo, Q., Liu, B., Zong, Z., Liu, Y., Luo, P., 2024. Raphael: Text-to-image generation via large mixture of diffusion paths. Advances in Neural Information Processing Systems 36

2024

[80] [80]

Plainmamba: Improving non-hierarchical mamba in visual recognition

Yang, C., Chen, Z., Espinosa, M., Ericsson, L., Wang, Z., Liu, J., Crowley, E.J., 2024. Plainmamba: Improving non-hierarchical mamba in visual recognition. arXiv preprint arXiv:2403.17695

work page arXiv 2024