pith. sign in

arxiv: 2605.14799 · v2 · pith:UVNLUZFNnew · submitted 2026-05-14 · 💻 cs.CV · cs.CR· cs.SI

Can Visual Mamba Improve AI-Generated Image Detection? An In-Depth Investigation

Pith reviewed 2026-06-30 21:15 UTC · model grok-4.3

classification 💻 cs.CV cs.CRcs.SI
keywords AI-generated image detectionVision Mambaimage forensicssynthetic image classificationdeep learning detectorscomputer vision backbonesgenerative model identification
0
0 comments X

The pith

Vision Mamba models exhibit competitive efficiency yet lower accuracy and weaker generalization than CNNs, ViTs, and VLMs when detecting AI-generated images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a head-to-head benchmark of multiple Vision Mamba variants against CNN, Vision Transformer, and vision-language model detectors on several public datasets containing both real photographs and images produced by GANs and diffusion models. It measures accuracy, inference speed, and how performance holds when the test images come from generators or visual domains not seen during training. A reader would care because scalable, low-cost detectors are needed to flag synthetic content that can spread misinformation or enable fraud. The analysis concludes that Mamba backbones offer a speed advantage but fall short on the core classification task under current training regimes.

Core claim

Vision Mamba architectures, when adapted for binary real-versus-synthetic classification, achieve inference speeds that surpass most transformer baselines while delivering accuracy that remains below the best CNN and VLM detectors; the gap widens on out-of-distribution generators, showing that state-space visual models can contribute to detection pipelines but require additional adaptation to match established methods in reliability.

What carries the argument

Vision Mamba, a selective state-space model backbone for image classification, evaluated here as a drop-in feature extractor for distinguishing authentic from AI-generated images.

If this is right

  • Mamba-based detectors can reduce computational cost in large-scale screening systems that must process millions of images daily.
  • The observed accuracy shortfall implies that pure Mamba pipelines may need supplementary modules such as frequency-domain filters or ensemble heads to reach deployment thresholds.
  • Cross-generator evaluation shows that training on a narrow set of synthetic sources produces brittle detectors, regardless of backbone architecture.
  • Efficiency gains position Vision Mamba as a candidate for on-device or edge-based detection where latency matters more than marginal accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid architectures that replace only the attention layers of a ViT with Mamba blocks could combine the strengths of both without full retraining.
  • The speed advantage may prove decisive in video or live-stream settings where frame-by-frame detection is required.
  • Transfer from Mamba models pretrained on medical or satellite imagery could supply better initial features for the detection task than ImageNet weights alone.

Load-bearing premise

The chosen datasets, generative models, and evaluation metrics sufficiently represent real-world conditions and capture generalizability for AI-generated image detection.

What would settle it

Retraining and testing the same Mamba variants on a fresh dataset of images from a diffusion model released after the paper's experiments, using the identical train-test split protocol, would show whether the reported accuracy gap persists.

Figures

Figures reproduced from arXiv: 2605.14799 by Abdelmalik Taleb-Ahmed, Abdenour Hadid, Hessen Bougueffa Eutamene, Mamadou Keita, Wassim Hamidouche, Xianxun Zhu.

Figure 1
Figure 1. Figure 1: Comparison of backbone architectures.Each marker corresponds to a specific model family (ResNet, DeiT, VSSD, ...), and straight lines connect variants belonging to the same architectural family (tiny, small, base, large). The x-axis shows the number of parameters on a logarithmic scale, while the y-axis shows the ImageNet-1K top-1 accuracy reported for each model family in the original papers. The plot hig… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of various scanning strategies used in Mamba models to process visual inputs. Each scan strategy processes image sequences or spatial tokens in a distinct order. This balances computational efficiency, long-range dependency modeling, and fine-grained feature extraction. The scanning approaches shape the flow of information and the receptive field. They ultimately affect the model’s ability to … view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of Mamba block [21] 3.2. Selective State Space Model While the classical SSM provides a robust framework for analyzing dynamic systems, it operates under the assumption of linear time-invariant (LTI) dynamics. This implies that the parameters 𝐴, 𝐵, and 𝐶 remain constant over time, limiting the model’s flexibility when dealing with complex, non-stationary input signals. To address this limitati… view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of representative Visual Mamba blocks, including Vim [85], VSSD [39], SSD [63], and the MambaVision mixer [22]. The figure delineates the architectural distinctions among these variants, showing how each design tailors the Mamba state-space paradigm to visual processing via dedicated token mixing, spatial scanning approaches, and feature refinement techniques. Together, these modules exemplify… view at source ↗
Figure 5
Figure 5. Figure 5: Refined Vim [85] architecture for AI-generated image detection. Vim [85] introduces the first pure SSM-based model for vision tasks. The authors highlight two major challenges of applying SSM to vision tasks: modeling uni-directionality and lack of location awareness. Vim incorporates bidirectional SSM and positional embedding techniques to overcome these challenges. As depicted in [PITH_FULL_IMAGE:figure… view at source ↗
Figure 6
Figure 6. Figure 6: Refined MambaVision [22] architecture for AI-generated image detection. more closely with the non-sequential nature of image data. In addition, a symmetric branch has been incorporated into the mixer design. This branch complements the Mamba-based component’s sequential modeling, focusing on spatial features through the implementation of additional convolutional operations. The outputs from both branches a… view at source ↗
Figure 7
Figure 7. Figure 7: Refined VSSD [63] architecture for AI-generated image detection. VSSD [63] introduces a novel approach to applying state space models (SSMs) to vision tasks. An overview of the approach is illustrated in [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Each sub-figure is a random image from a testing set, labeled below. The 4-digit binary code shows results from ResNet, Xception, DeiT, and BLIP2 models, where ’0’ means real and ’1’ means fake. All generated images are considered fake. reported in the C2P-CLIP paper. Implementation details. In our experiments, we leveraged the PyTorch deep learning framework on a Linux computer equipped with a 16 GB NVIDI… view at source ↗
Figure 9
Figure 9. Figure 9: Performance comparison of multiple models on AntifakePrompt and Bedroom datasets. The x-axis shows model size (parameters), and the y-axis represents accuracy. Model families like SSMs, CNNs, attention-based models, and VLMs are compared to evaluate efficiency and effectiveness trade-offs. smallest model, achieves near-perfect results on real images (99.97%) and LDM (96.93%) but performs poorly on ADM (00.… view at source ↗
Figure 10
Figure 10. Figure 10: Performance comparison of models on the UniversalFakeDetect dataset, including frequency-based (Freq￾spec, Co-occurrence), convolutional (CNN-Spot, F3Net), transformer-based (FatFormer), SSM-based (Vim, VSSD), hybrid (MambaVision, Bi-LORA, LGrad), and multimodal (AntifakePrompt, Bi-LORA, C2P-CLIP) architectures. between real and AI-generated images. This can be attributed to their architectural advantages… view at source ↗
Figure 11
Figure 11. Figure 11: t-SNE Visualization of Feature Distributions of Vim Model. The scatter plots illustrate the t-SNE embeddings of features extracted from real (green) and generated (red) images across various generative models, showing how well the features separate real from fake images [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: t-SNE Visualization of Feature Distributions of VSSD Model. The scatter plots illustrate the t-SNE embeddings of features extracted from real (green) and generated (red) images across various generative models, showing how well the features separate real from fake images. features necessary to differentiate between real and synthetic images. Furthermore, the inability of Vision Mamba models to effectively… view at source ↗
Figure 13
Figure 13. Figure 13: t-SNE Visualization of Feature Distributions of MambaVision Model. The scatter plots illustrate the t-SNE embeddings of features extracted from real (green) and generated (red) images across various generative models, showing how well the features separate real from fake images. relationships. To overcome these limitations, some approaches introduce multiple scanning techniques to expand the receptive fie… view at source ↗
read the original abstract

In recent years, computer vision has witnessed remarkable progress, fueled by the development of innovative architectures such as Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), diffusion-based architectures, Vision Transformers (ViTs), and, more recently, Vision-Language Models (VLMs). This progress has undeniably contributed to creating increasingly realistic and diverse visual content. However, such advancements in image generation also raise concerns about potential misuse in areas such as misinformation, identity theft, and threats to privacy and security. In parallel, Mamba-based architectures have emerged as versatile tools for a range of image analysis tasks, including classification, segmentation, medical imaging, object detection, and image restoration, in this rapidly evolving field. However, their potential for identifying AI-generated images remains relatively unexplored compared to established techniques. This study provides a systematic evaluation and comparative analysis of Vision Mamba models for AI-generated image detection. We benchmark multiple Vision Mamba variants against representative CNNs, ViTs, and VLM-based detectors across diverse datasets and synthetic image sources, focusing on key metrics such as accuracy, efficiency, and generalizability across diverse image types and generative models. Through this comprehensive analysis, we aim to elucidate Vision Mamba's strengths and limitations relative to established methodologies in terms of applicability, accuracy, and efficiency in detecting AI-generated images. Overall, our findings highlight both the promise and current limitations of Vision Mamba as a component in systems designed to distinguish authentic from AI-generated visual content. This research is crucial for enhancing detection in an age where distinguishing between real and AI-generated content is a major challenge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a benchmark study evaluating several Vision Mamba variants for the task of distinguishing authentic images from AI-generated ones. It compares these models against representative CNNs, Vision Transformers, and VLM-based detectors across multiple datasets and generative sources, reporting on accuracy, efficiency, and cross-model generalizability, and concludes that Mamba architectures show both promise and current limitations for this application.

Significance. If the empirical comparisons prove robust and reproducible, the work would supply a useful reference point for selecting efficient sequence-modeling backbones in synthetic-media detection pipelines, particularly where computational cost is a concern relative to transformer-based alternatives.

major comments (2)
  1. [Abstract] The abstract states that the study benchmarks 'multiple Vision Mamba variants' and reports 'key metrics such as accuracy, efficiency, and generalizability,' yet no quantitative results, tables, or statistical details (e.g., means, standard deviations, or significance tests) appear in the provided text; without these, the central claim of 'promise and current limitations' cannot be evaluated.
  2. [Abstract / Experimental Setup] The weakest assumption identified—that the chosen datasets and generative models capture real-world generalizability—is load-bearing for the paper's conclusions, but the manuscript supplies no information on data splits, number of runs, or out-of-distribution test sets that would allow readers to assess this assumption.
minor comments (1)
  1. [Introduction] The introduction lists recent architectures but does not cite the original Mamba or Vision Mamba papers; adding these references would improve context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our benchmark study of Vision Mamba for AI-generated image detection. We address each major comment below and will revise the manuscript to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Abstract] The abstract states that the study benchmarks 'multiple Vision Mamba variants' and reports 'key metrics such as accuracy, efficiency, and generalizability,' yet no quantitative results, tables, or statistical details (e.g., means, standard deviations, or significance tests) appear in the provided text; without these, the central claim of 'promise and current limitations' cannot be evaluated.

    Authors: We agree that the abstract would be strengthened by including key quantitative highlights. In the revised version, we will add specific results such as average accuracies (with standard deviations where multiple runs were performed), efficiency comparisons (e.g., FLOPs or inference time), and a brief note on generalizability trends to better support the claims of promise and limitations. revision: yes

  2. Referee: [Abstract / Experimental Setup] The weakest assumption identified—that the chosen datasets and generative models capture real-world generalizability—is load-bearing for the paper's conclusions, but the manuscript supplies no information on data splits, number of runs, or out-of-distribution test sets that would allow readers to assess this assumption.

    Authors: We acknowledge that explicit details on experimental reproducibility are essential. While the Experimental Setup section describes the datasets and generative sources, we will expand it in revision to include precise train/validation/test splits, the number of independent runs with reported means and standard deviations, and any out-of-distribution evaluations to allow readers to better evaluate the generalizability claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical benchmark

full rationale

The paper performs a systematic empirical comparison of Vision Mamba variants against CNNs, ViTs, and VLMs on multiple datasets for AI-generated image detection. It reports accuracy, efficiency, and generalizability metrics from direct experiments with no equations, derivations, fitted parameters relabeled as predictions, or load-bearing self-citations. The abstract and described scope frame the work as an external benchmark study whose results are falsifiable against held-out data and independent implementations. No self-definitional, ansatz-smuggling, or renaming patterns appear. This matches the default expectation for non-circular empirical papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated beyond standard benchmarking assumptions.

pith-pipeline@v0.9.1-grok · 5855 in / 933 out tokens · 21431 ms · 2026-06-30T21:15:21.674507+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

86 extracted references · 35 canonical work pages · 13 internal anchors

  1. [1]

    Create with firefly generative ai.https://www.adobe.com/products/firefly.html

    Adobe, 2023. Create with firefly generative ai.https://www.adobe.com/products/firefly.html. Accessed: 2024-10-10

  2. [2]

    Large Scale GAN Training for High Fidelity Natural Image Synthesis

    Brock, A., 2018. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096

  3. [3]

    Chai, L., Bau, D., Lim, S.N., Isola, P., 2020. What makes fake images detectable? understanding properties that generalize, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16, Springer. pp. 103–120

  4. [4]

    Antifakeprompt: Prompt-tuned vision-language models are fake image detectors

    Chang, Y.M., Yeh, C., Chiu, W.C., Yu, N., 2023. Antifakeprompt: Prompt-tuned vision-language models are fake image detectors. arXiv preprint arXiv:2310.17419

  5. [5]

    Learning to see in the dark, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp

    Chen, C., Chen, Q., Xu, J., Koltun, V., 2018. Learning to see in the dark, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3291–3300

  6. [6]

    Photographic image synthesis with cascaded refinement networks, in: Proceedings of the IEEE international conference on computer vision, pp

    Chen, Q., Koltun, V., 2017. Photographic image synthesis with cascaded refinement networks, in: Proceedings of the IEEE international conference on computer vision, pp. 1511–1520

  7. [7]

    Guidedandfused:Efficientfrozenclip-vitwithfeatureguidanceandmulti-stage feature fusion for generalizable deepfake detection

    Chen,Y.,Zhang,L.,Niu,Y.,Chen,P.,Tan,L.,Zhou,J.,2024. Guidedandfused:Efficientfrozenclip-vitwithfeatureguidanceandmulti-stage feature fusion for generalizable deepfake detection. arXiv preprint arXiv:2408.13697

  8. [8]

    Stargan: Unified generative adversarial networks for multi-domain image-to- image translation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp

    Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J., 2018. Stargan: Unified generative adversarial networks for multi-domain image-to- image translation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8789–8797

  9. [9]

    Xception:Deeplearningwithdepthwiseseparableconvolutions,in:ProceedingsoftheIEEEconferenceoncomputervision and pattern recognition, pp

    Chollet,F.,2017. Xception:Deeplearningwithdepthwiseseparableconvolutions,in:ProceedingsoftheIEEEconferenceoncomputervision and pattern recognition, pp. 1251–1258

  10. [10]

    arXiv:2312.00195

    Cozzolino,D.,Poggi,G.,Corvi,R.,Nießner,M.,Verdoliva,L.,2023. Raisingthebarofai-generatedimagedetectionwithclip. arXivpreprint arXiv:2312.00195

  11. [11]

    Second-order attention network for single image super-resolution, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

    Dai, T., Cai, J., Zhang, Y., Xia, S.T., Zhang, L., 2019. Second-order attention network for single image super-resolution, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11065–11074

  12. [12]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    Dai,W.,Li,J.,Li,D.,Tiong,A.M.H.,Zhao,J.,Wang,W.,Li,B.,Fung,P.,Hoi,S.,2023.Instructblip:Towardsgeneral-purposevision-language models with instruction tuning.arXiv:2305.06500. : Preprint submitted to Elsevier Page 21 of 24

  13. [13]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Dao, T., Gu, A., 2024. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060

  14. [14]

    Vision Transformers Need Registers

    Darcet, T., Oquab, M., Mairal, J., Bojanowski, P., 2023. Vision transformers need registers. arXiv preprint arXiv:2309.16588

  15. [15]

    Diffusion models beat gans on image synthesis

    Dhariwal, P., Nichol, A., 2021. Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34, 8780–8794

  16. [16]

    Cogview2:Fasterandbettertext-to-imagegenerationviahierarchicaltransformers

    Ding,M.,Zheng,W.,Hong,W.,Tang,J.,2022. Cogview2:Fasterandbettertext-to-imagegenerationviahierarchicaltransformers. Advances in Neural Information Processing Systems 35, 16890–16902

  17. [17]

    Fusion-mambaforcross-modalityobjectdetection

    Dong,W.,Zhu,H.,Lin,S.,Luo,X.,Shen,Y.,Liu,X.,Zhang,J.,Guo,G.,Zhang,B.,2024. Fusion-mambaforcross-modalityobjectdetection. arXiv preprint arXiv:2404.09146

  18. [18]

    A synthetic data generation system based on the variational-autoencoder technique and the linked data paradigm

    Dos Santos, R., Aguilar, J., 2024. A synthetic data generation system based on the variational-autoencoder technique and the linked data paradigm. Progress in Artificial Intelligence , 1–15

  19. [19]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A., 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  20. [20]

    Generativeadversarialnets

    Goodfellow,I.,Pouget-Abadie,J.,Mirza,M.,Xu,B.,Warde-Farley,D.,Ozair,S.,Courville,A.,Bengio,Y.,2014. Generativeadversarialnets. Advances in neural information processing systems 27

  21. [21]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Gu, A., Dao, T., 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752

  22. [22]

    Mambavision: Ahybridmamba-transformer visionbackbone,in: Proceedingsofthe ComputerVisionand Pattern Recognition Conference, pp

    Hatamizadeh,A., Kautz,J.,2025. Mambavision: Ahybridmamba-transformer visionbackbone,in: Proceedingsofthe ComputerVisionand Pattern Recognition Conference, pp. 25261–25270

  23. [23]

    Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp

    He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778

  24. [24]

    Ho,J.,Jain,A.,Abbeel,P.,2020.Denoisingdiffusionprobabilisticmodels.Advancesinneuralinformationprocessingsystems33,6840–6851

  25. [25]

    Localmamba: Visual state space model with windowed selective scan

    Huang, T., Pei, X., You, S., Wang, F., Qian, C., Xu, C., 2024a. Localmamba: Visual state space model with windowed selective scan. arXiv preprint arXiv:2403.09338

  26. [26]

    Ffaa: Multimodal large language model based explainable open-world face forgery analysis assistant

    Huang, Z., Xia, B., Lin, Z., Mou, Z., Yang, W., 2024b. Ffaa: Multimodal large language model based explainable open-world face forgery analysis assistant. arXiv preprint arXiv:2408.10072

  27. [27]

    Synthetic face discrimination via learned image compression

    Iliopoulou, S., Tsinganos, P., Ampeliotis, D., Skodras, A., 2024. Synthetic face discrimination via learned image compression. Algorithms 17, 375

  28. [28]

    Evolutionofdetectionperformancethroughout the online lifespan of synthetic images, in: European Conference on Computer Vision, Springer

    Karageogiou,D.,Bammey,Q.,Porcellini,V.,Goupil,B.,Teyssou,D.,Papadopoulos,S.,2024. Evolutionofdetectionperformancethroughout the online lifespan of synthetic images, in: European Conference on Computer Vision, Springer. pp. 400–417

  29. [29]

    Progressive Growing of GANs for Improved Quality, Stability, and Variation

    Karras, T., 2017. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196

  30. [30]

    Alias-freegenerativeadversarialnetworks

    Karras,T.,Aittala,M.,Laine,S.,Härkönen,E.,Hellsten,J.,Lehtinen,J.,Aila,T.,2021. Alias-freegenerativeadversarialnetworks. Advances in neural information processing systems 34, 852–863

  31. [31]

    Astyle-basedgeneratorarchitectureforgenerativeadversarialnetworks,in:ProceedingsoftheIEEE/CVF conference on computer vision and pattern recognition, pp

    Karras,T.,Laine,S.,Aila,T.,2019. Astyle-basedgeneratorarchitectureforgenerativeadversarialnetworks,in:ProceedingsoftheIEEE/CVF conference on computer vision and pattern recognition, pp. 4401–4410

  32. [32]

    Harnessing the power of large vision language models for synthetic image detection

    Keita, M., Hamidouche, W., Bougueffa, H., Hadid, A., Taleb-Ahmed, A., 2024. Harnessing the power of large vision language models for synthetic image detection. arXiv preprint arXiv:2404.02726

  33. [33]

    Bi-lora:Avision-languageapproach for synthetic image detection

    Keita,M.,Hamidouche,W.,BougueffaEutamene,H.,Taleb-Ahmed,A.,Camacho,D.,Hadid,A.,2025. Bi-lora:Avision-languageapproach for synthetic image detection. Expert Systems 42, e13829

  34. [34]

    Texturecrop: Enhancing synthetic image detection through texture-based cropping

    Konstantinidou, D., Koutlis, C., Papadopoulos, S., 2024. Texturecrop: Enhancing synthetic image detection through texture-based cropping. arXiv preprint arXiv:2407.15500

  35. [35]

    Diverse image synthesis from semantic layouts via conditional imle, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Li, K., Zhang, T., Malik, J., 2019. Diverse image synthesis from semantic layouts via conditional imle, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4220–4229

  36. [36]

    Mamba-nd:Selectivestatespacemodelingformulti-dimensionaldata

    Li,S.,Singh,H.,Grover,A.,2024. Mamba-nd:Selectivestatespacemodelingformulti-dimensionaldata. arXivpreprintarXiv:2402.05892

  37. [37]

    Forgery-awareadaptivetransformerforgeneralizablesyntheticimagedetection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Liu,H.,Tan,Z.,Tan,C.,Wei,Y.,Wang,J.,Zhao,Y.,2024a. Forgery-awareadaptivetransformerforgeneralizablesyntheticimagedetection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10770–10780

  38. [38]

    arXiv preprint arXiv:2202.09778 (2022)

    Liu, L., Ren, Y., Lin, Z., Zhao, Z., 2022. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778

  39. [39]

    VMamba: Visual State Space Model

    Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Liu, Y., 2024b. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166

  40. [40]

    U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation

    Ma, J., Li, F., Wang, B., 2024. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722

  41. [41]

    Midjourney v5.https://www.midjourney.com

    MidJourney, 2023. Midjourney v5.https://www.midjourney.com. Accessed: 2024-10-10

  42. [42]

    Detecting gan generated fake images using co-occurrence matrices

    Nataraj, L., Mohammed, T.M., Chandrasekaran, S., Flenner, A., Bappy, J.H., Roy-Chowdhury, A.K., Manjunath, B., 2019. Detecting gan generated fake images using co-occurrence matrices. arXiv preprint arXiv:1903.06836

  43. [43]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Nichol,A.,Dhariwal,P.,Ramesh,A.,Shyam,P.,Mishkin,P.,McGrew,B.,Sutskever,I.,Chen,M.,2021. Glide:Towardsphotorealisticimage generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741

  44. [44]

    Improved denoising diffusion probabilistic models, in: International Conference on Machine Learning, PMLR

    Nichol, A.Q., Dhariwal, P., 2021. Improved denoising diffusion probabilistic models, in: International Conference on Machine Learning, PMLR. pp. 8162–8171

  45. [45]

    Towards universal fake image detectors that generalize across generative models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Ojha, U., Li, Y., Lee, Y.J., 2023. Towards universal fake image detectors that generalize across generative models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24480–24489

  46. [46]

    Dall-e 3.https://openai.com/dall-e-3

    OpenAI, 2023. Dall-e 3.https://openai.com/dall-e-3. Accessed: 2024-10-10

  47. [47]

    Theaffectivenatureofai-generatednewsimages:Impact on visual journalism, in: 2023 11th International Conference on Affective Computing and Intelligent Interaction (ACII), IEEE

    Paik,S.,Bonna,S.,Novozhilova,E.,Gao,G.,Kim,J.,Wijaya,D.,Betke,M.,2023. Theaffectivenatureofai-generatednewsimages:Impact on visual journalism, in: 2023 11th International Conference on Affective Computing and Intelligent Interaction (ACII), IEEE. pp. 1–8

  48. [48]

    Gaugan: semantic image synthesis with spatially adaptive normalization, in: ACM SIGGRAPH 2019 Real-Time Live!, pp

    Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y., 2019. Gaugan: semantic image synthesis with spatially adaptive normalization, in: ACM SIGGRAPH 2019 Real-Time Live!, pp. 1–1. : Preprint submitted to Elsevier Page 22 of 24

  49. [49]

    Simba: Simplified mamba-based architecture for vision and multivariate time series

    Patro, B.N., Agneeswaran, V.S., 2024. Simba: Simplified mamba-based architecture for vision and multivariate time series. arXiv preprint arXiv:2403.15360

  50. [50]

    Efficientvmamba: Atrous selective scan for light weight visual mamba

    Pei, X., Huang, T., Xu, C., 2024. Efficientvmamba: Atrous selective scan for light weight visual mamba. arXiv preprint arXiv:2403.09977

  51. [51]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R., 2023. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952

  52. [52]

    Thinking in frequency: Face forgery detection by mining frequency-aware clues, in: European conference on computer vision, Springer

    Qian, Y., Yin, G., Sheng, L., Chen, Z., Shao, J., 2020. Thinking in frequency: Face forgery detection by mining frequency-aware clues, in: European conference on computer vision, Springer. pp. 86–103

  53. [53]

    Zero-shot text-to-image generation, in: International conference on machine learning, Pmlr

    Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I., 2021. Zero-shot text-to-image generation, in: International conference on machine learning, Pmlr. pp. 8821–8831

  54. [54]

    Autoregressive pretraining with mamba in vision

    Ren, S., Li, X., Tu, H., Wang, F., Shu, F., Zhang, L., Mei, J., Yang, L., Wang, P., Wang, H., et al., 2024. Autoregressive pretraining with mamba in vision. arXiv preprint arXiv:2406.07537

  55. [55]

    arXiv:2210.14571

    Ricker, J., Damm, S., Holz, T., Fischer, A., 2022. Towards the detection of diffusion model deepfakes. arXiv preprint arXiv:2210.14571

  56. [56]

    High-resolution image synthesis with latent diffusion models, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B., 2022. High-resolution image synthesis with latent diffusion models, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695

  57. [57]

    Faceforensics++: Learning to detect manipulated facial images, in: Proceedings of the IEEE/CVF international conference on computer vision, pp

    Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M., 2019. Faceforensics++: Learning to detect manipulated facial images, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 1–11

  58. [58]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

    Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K., 2023. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 22500–22510

  59. [59]

    Saharia,C.,Chan,W.,Saxena,S.,Li,L.,Whang,J.,Denton,E.L.,Ghasemipour,K.,GontijoLopes,R.,KaragolAyan,B.,Salimans,T.,etal.,

  60. [60]

    Advances in neural information processing systems 35, 36479–36494

    Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems 35, 36479–36494

  61. [61]

    Stylegan-xl: Scaling stylegan to large diverse datasets, in: ACM SIGGRAPH 2022 conference proceedings, pp

    Sauer, A., Schwarz, K., Geiger, A., 2022. Stylegan-xl: Scaling stylegan to large diverse datasets, in: ACM SIGGRAPH 2022 conference proceedings, pp. 1–10

  62. [62]

    Instantbooth:Personalizedtext-to-imagegenerationwithouttest-timefinetuning,in:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Shi,J.,Xiong,W.,Lin,Z.,Jung,H.J.,2024a. Instantbooth:Personalizedtext-to-imagegenerationwithouttest-timefinetuning,in:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8543–8552

  63. [63]

    Multi-scale vmamba: Hierarchy in hierarchy visual state space model

    Shi, Y., Dong, M., Xu, C., 2024b. Multi-scale vmamba: Hierarchy in hierarchy visual state space model. arXiv preprint arXiv:2405.14174

  64. [64]

    10819–10829

    Shi,Y.,Li,M.,Dong,M.,Xu,C.,2025.Vssd:Visionmambawithnon-causalstatespaceduality,in:ProceedingsoftheIEEE/CVFInternational Conference on Computer Vision, pp. 10819–10829

  65. [65]

    Deep unsupervised learning using nonequilibrium thermodynamics, in: International conference on machine learning, PMLR

    Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S., 2015. Deep unsupervised learning using nonequilibrium thermodynamics, in: International conference on machine learning, PMLR. pp. 2256–2265

  66. [66]

    Mamba: Multi-level aggregation via memory bank for video object detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp

    Sun, G., Hua, Y., Hu, G., Robertson, N., 2021. Mamba: Multi-level aggregation via memory bank for video object detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2620–2627

  67. [67]

    7184–7192

    Tan,C.,Tao,R.,Liu,H.,Gu,G.,Wu,B.,Zhao,Y.,Wei,Y.,2025.C2p-clip:Injectingcategorycommonpromptincliptoenhancegeneralization in deepfake detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7184–7192

  68. [68]

    Frequency-awaredeepfakedetection:Improvinggeneralizabilitythroughfrequency space domain learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp

    Tan,C.,Zhao,Y.,Wei,S.,Gu,G.,Liu,P.,Wei,Y.,2024a. Frequency-awaredeepfakedetection:Improvinggeneralizabilitythroughfrequency space domain learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 5052–5060

  69. [69]

    Tan, C., Zhao, Y., Wei, S., Gu, G., Liu, P., Wei, Y., 2024b. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 28130– 28139

  70. [70]

    Tan, C., Zhao, Y., Wei, S., Gu, G., Wei, Y., 2023. Learning on gradients: Generalized artifacts representation for gan-generated images detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12105–12114

  71. [71]

    Scalable visual state space model with fractal scanning

    Tang, L., Xiao, H., Jiang, P.T., Zhang, H., Chen, J., Li, B., 2024. Scalable visual state space model with fractal scanning. arXiv preprint arXiv:2405.14480

  72. [72]

    Training data-efficient image transformers & distillation through attention, in: International conference on machine learning, PMLR

    Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H., 2021. Training data-efficient image transformers & distillation through attention, in: International conference on machine learning, PMLR. pp. 10347–10357

  73. [73]

    Powersgd: Practical low-rank gradient compression for distributed optimization

    Vogels, T., Karimireddy, S.P., Jaggi, M., 2019. Powersgd: Practical low-rank gradient compression for distributed optimization. Advances in Neural Information Processing Systems 32

  74. [74]

    Mamba-r: Vision mamba also needs registers

    Wang, F., Wang, J., Ren, S., Wei, G., Mei, J., Shao, W., Zhou, Y., Yuille, A., Xie, C., 2024. Mamba-r: Vision mamba also needs registers. arXiv preprint arXiv:2405.14858

  75. [75]

    Cnn-generatedimagesaresurprisinglyeasytospot...fornow,in:Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

    Wang,S.Y.,Wang,O.,Zhang,R.,Owens,A.,Efros,A.A.,2020. Cnn-generatedimagesaresurprisinglyeasytospot...fornow,in:Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8695–8704

  76. [76]

    Dire for diffusion-generated image detection

    Wang, Z., Bao, J., Zhou, W., Wang, W., Hu, H., Chen, H., Li, H., 2023. Dire for diffusion-generated image detection. arXiv preprint arXiv:2303.09295

  77. [77]

    Fd-gan: Generalizable and robust forgery detection via generative adversarial networks

    Xu, N., Feng, W., Zhang, T., Zhang, Y., 2024. Fd-gan: Generalizable and robust forgery detection via generative adversarial networks. International Journal of Computer Vision , 1–19

  78. [78]

    Tall: Thumbnail layout for deepfake video detection, in: Proceedings of the IEEE/CVF international conference on computer vision, pp

    Xu, Y., Liang, J., Jia, G., Yang, Z., Zhang, Y., He, R., 2023. Tall: Thumbnail layout for deepfake video detection, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 22658–22668

  79. [79]

    Raphael: Text-to-image generation via large mixture of diffusion paths

    Xue, Z., Song, G., Guo, Q., Liu, B., Zong, Z., Liu, Y., Luo, P., 2024. Raphael: Text-to-image generation via large mixture of diffusion paths. Advances in Neural Information Processing Systems 36

  80. [80]

    Plainmamba: Improving non-hierarchical mamba in visual recognition

    Yang, C., Chen, Z., Espinosa, M., Ericsson, L., Wang, Z., Liu, J., Crowley, E.J., 2024. Plainmamba: Improving non-hierarchical mamba in visual recognition. arXiv preprint arXiv:2403.17695

Showing first 80 references.