pith. sign in

arxiv: 2511.12834 · v2 · submitted 2025-11-16 · 💻 cs.CV · cs.AI

SAGA: Source Attribution of Generative AI Videos

Pith reviewed 2026-05-17 21:28 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords source attributiongenerative AI videosvideo forensicstemporal attention signaturesdata-efficient learningsynthetic video provenancemulti-granular attribution
0
0 comments X

The pith

SAGA attributes generative AI videos to their exact source model using only 0.5 percent labeled data per class.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SAGA to move beyond binary real-versus-fake detection by identifying the precise generative model that produced a synthetic video. It does so at five levels of granularity: whether the video is authentic, which task created it, which model version, which team developed it, and which generator was used. The method rests on a video transformer that pulls spatio-temporal artifacts from a robust vision foundation model and a pretrain-and-attribute strategy that reaches state-of-the-art accuracy with only 0.5 percent of the usual labeled data per class. It also supplies Temporal Attention Signatures that visualize the temporal patterns distinguishing one generator from another. A reader would care because hyper-realistic synthetic videos already outpace simple detectors and now require traceable provenance for forensic and regulatory use.

Core claim

SAGA is the first framework for multi-granular source attribution of generative AI videos across authenticity, generation task such as text-to-video or image-to-video, model version, development team, and the exact generator. Its video transformer architecture extracts distinguishing spatio-temporal artifacts from features of a robust vision foundation model, while a data-efficient pretrain-and-attribute strategy achieves state-of-the-art performance with only 0.5 percent source-labeled data per class and matches fully supervised results. Temporal Attention Signatures provide the first visual explanation of why different video generators remain distinguishable by highlighting learned timing,

What carries the argument

The data-efficient pretrain-and-attribute strategy combined with Temporal Attention Signatures inside a video transformer that processes features from a robust vision foundation model to isolate stable spatio-temporal artifacts.

Load-bearing premise

Spatio-temporal artifacts extracted from a robust vision foundation model stay unique, stable, and transferable enough across generators and domains to support accurate attribution even when labeled data is reduced to 0.5 percent per class.

What would settle it

Apply SAGA to videos produced by a new generator unseen during training and measure whether attribution accuracy drops well below the fully supervised baseline.

Figures

Figures reproduced from arXiv: 2511.12834 by Amit K. Roy-Chowdhury, Athula Balachandran, Hao Xiong, Rohit Kundu, Shan Jia, Vishal Mohanty.

Figure 1
Figure 1. Figure 1: SAGA: Data-Efficient & Interpretable AI Video Source Attribution. (a) Temporal Attention Signatures (T-Sigs): SAGA pioneers AI video source attribution. Our novel T-Sigs provide interpretability, showing unique fingerprints for Real, Seen, and even Unseen generators. (b) Feature Separability: t-SNE visualization of learned features demonstrates clear generator clusters. (c) Multi-Granular Performance & Dat… view at source ↗
Figure 2
Figure 2. Figure 2: Overall framework of SAGA with a two-stage training approach. In Stage-1, each video xk with real/fake labels is processed through a frozen foundational vision encoder to extract image-level features zm, which are stacked in temporal order to form the video representation ζk. Positional encoding is added, and the sequence is passed through our video transformer architecture θ (Sec. 3.1) to obtain ϕk. The c… view at source ↗
Figure 3
Figure 3. Figure 3: HNM enables bet￾ter separation boundaries be￾tween classes while semi￾HNM will exclude these samples from the loss. This focuses the model on the most challenging negatives within the batch. In our source attribution task, some generators produced embeddings with over￾lapping t-SNE clusters when trained with CE-loss alone ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: t-SNE visualization of SAGA’s learned representations trained on the TASK-L, BIN-L, SD-L and TEAM-L attribution tasks, respectively. Even when supervised at coarser levels, SAGA distinctly clusters individual generators, revealing strong fine-grained discriminative ability [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: t-SNE visualization of SAGA on the GEN-L attribution task with different loss functions [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: T-Sigs for classes in the different attribution levels. level. This level of separation indicates that the model is sensi￾tive to subtle distributional differences introduced by specific generator architectures or research teams, enabling it to infer whether an unknown generator shares an SD backbone or team affiliation, or represents a completely novel source. The t-SNE analysis for the GEN-L attribution … view at source ↗
read the original abstract

The proliferation of generative AI has led to hyper-realistic synthetic videos, escalating misuse risks and outstripping binary real/fake detectors. We introduce SAGA (Source Attribution of Generative AI videos), the first comprehensive framework to address the urgent need for AI-generated video source attribution at a large scale. Unlike traditional detection, SAGA identifies the specific generative model used. It uniquely provides multi-granular attribution across five levels: authenticity, generation task (e.g., T2V/I2V), model version, development team, and the precise generator, offering far richer forensic insights. Our novel video transformer architecture, leveraging features from a robust vision foundation model, effectively captures spatio-temporal artifacts. Critically, we introduce a data-efficient pretrain-and-attribute strategy, enabling SAGA to achieve state-of-the-art attribution using only 0.5\% of source-labeled data per class, matching fully supervised performance. Furthermore, we propose Temporal Attention Signatures (T-Sigs), a novel interpretability method that visualizes learned temporal differences, offering the first explanation for why different video generators are distinguishable. Extensive experiments on public datasets, including cross-domain scenarios, demonstrate that SAGA sets a new benchmark for synthetic video provenance, providing crucial, interpretable insights for forensic and regulatory applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SAGA, a framework for multi-granular source attribution of generative AI videos across five levels (authenticity, generation task such as T2V/I2V, model version, development team, and precise generator). It proposes a video transformer architecture leveraging features from a robust vision foundation model to capture spatio-temporal artifacts, combined with a data-efficient pretrain-and-attribute strategy. The central claims are that this achieves state-of-the-art attribution performance using only 0.5% of source-labeled data per class while matching fully supervised results, and that the novel Temporal Attention Signatures (T-Sigs) provide the first explanation for generator distinguishability. Experiments on public datasets including cross-domain scenarios are said to support these results.

Significance. If the performance and interpretability claims hold after addressing controls for confounds, this would advance AI-generated video forensics beyond binary detection by enabling precise provenance tracking with minimal supervision and offering visual explanations of model-specific artifacts. Such capabilities could support regulatory and forensic applications in a domain where generative video misuse is growing rapidly.

major comments (2)
  1. [Abstract] Abstract: The claim that SAGA achieves state-of-the-art attribution matching fully supervised performance with only 0.5% source-labeled data per class supplies no quantitative details on baselines, error bars, data splits, or ablation studies. This information is load-bearing for evaluating whether the empirical results support the central data-efficiency claim.
  2. [Abstract] Abstract (cross-domain scenarios): The reported cross-domain results do not include explicit controls to demonstrate that the spatio-temporal artifacts extracted from the vision foundation model are dominated by stable, generator-specific temporal signatures rather than content statistics, prompt distributions, or video length/resolution cues. Without such controls, the multi-granular attribution performance (including version/team-level) could be undermined by distribution shift, directly affecting the weakest assumption underlying both the pretrain-and-attribute pipeline and T-Sigs.
minor comments (1)
  1. The five attribution granularity levels are listed in the abstract but would benefit from an early table or diagram defining each level with examples to improve clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which help us improve the clarity and robustness of our work. We address the major comments point by point below, proposing revisions to the manuscript where necessary.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that SAGA achieves state-of-the-art attribution matching fully supervised performance with only 0.5% source-labeled data per class supplies no quantitative details on baselines, error bars, data splits, or ablation studies. This information is load-bearing for evaluating whether the empirical results support the central data-efficiency claim.

    Authors: We agree that incorporating quantitative details into the abstract would strengthen the presentation of our central claim. In the revised manuscript, we will update the abstract to include specific metrics, such as the top-1 attribution accuracy with 0.5% labeled data compared to fully supervised baselines, mention the use of standard data splits, and note that error bars and ablation studies are detailed in the experimental sections. This revision will provide the necessary context without exceeding abstract length constraints. revision: yes

  2. Referee: [Abstract] Abstract (cross-domain scenarios): The reported cross-domain results do not include explicit controls to demonstrate that the spatio-temporal artifacts extracted from the vision foundation model are dominated by stable, generator-specific temporal signatures rather than content statistics, prompt distributions, or video length/resolution cues. Without such controls, the multi-granular attribution performance (including version/team-level) could be undermined by distribution shift, directly affecting the weakest assumption underlying both the pretrain-and-attribute pipeline and T-Sigs.

    Authors: We appreciate this important point on potential confounds. Our experiments across public datasets already incorporate variations in content, prompts, and video properties to test generalization. The Temporal Attention Signatures (T-Sigs) are introduced precisely to highlight generator-specific temporal patterns independent of content. To directly address the referee's concern, we will add explicit control experiments in the revision, such as evaluations on content-matched video pairs or ablations removing temporal components, to confirm that the attribution relies on stable generator signatures rather than spurious cues. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture and experimental validation

full rationale

The paper introduces an empirical video transformer architecture that extracts spatio-temporal features from a vision foundation model, combined with a pretrain-and-attribute training strategy. Central performance claims (SOTA attribution at 0.5% labeled data per class, multi-granular results, and cross-domain generalization) are presented as outcomes of extensive experiments on public datasets rather than as quantities derived by construction from the paper's own equations or definitions. Temporal Attention Signatures are proposed as a post-hoc interpretability visualization of learned temporal differences, with no indication that they reduce to fitted parameters or self-referential inputs. No load-bearing self-citations, uniqueness theorems, or ansatzes smuggled via prior author work are invoked to force the results; the derivation chain remains self-contained through empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so explicit free parameters, axioms, and invented entities cannot be audited in detail. The framework introduces methodological innovations (T-Sigs, pretrain-and-attribute) rather than new physical entities; standard deep-learning assumptions about feature uniqueness are implicit but not enumerated.

pith-pipeline@v0.9.0 · 5542 in / 1168 out tokens · 28814 ms · 2026-05-17T21:28:12.404686+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Who Generated This 3D Asset? Learning Source Attribution for Generative 3D Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Introduces the first passive source attribution benchmark for 22 generative 3D models and a Transformer achieving 97.22% accuracy under full supervision and 77.17% with 1% training data.

  2. Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection

    cs.CV 2026-05 unverdicted novelty 5.0

    VINA trains a single detector on images plus video frames using a cross-modal supervised contrastive objective, yielding bidirectional gains and SOTA results on 14 image, video, and in-the-wild benchmarks.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 2 Pith papers · 5 internal anchors

  1. [1]

    Getting vit in shape: Scaling laws for compute- optimal model design.Advances in Neural Information Process- ing Systems, 36, 2024

    Ibrahim M Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, and Lucas Beyer. Getting vit in shape: Scaling laws for compute- optimal model design.Advances in Neural Information Process- ing Systems, 36, 2024. 2

  2. [2]

    Deepfake media forensics: State of the art and challenges ahead.arXiv preprint arXiv:2408.00388, 2024

    Irene Amerini, Mauro Barni, Sebastiano Battiato, Paolo Bestagini, Giulia Boato, Tania Sari Bonaventura, Vittoria Bruni, Roberto Caldelli, Francesco De Natale, Rocco De Nicola, et al. Deepfake media forensics: State of the art and challenges ahead.arXiv preprint arXiv:2408.00388, 2024. 1

  3. [3]

    Vivit: A video vision trans- former

    Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luˇci´c, and Cordelia Schmid. Vivit: A video vision trans- former. InProceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021. 6

  4. [4]

    Ai-generated content: authorship and inventorship in the age of artificial in- telligence

    Rosa Maria Ballardini, Kan He, and Teemu Roos. Ai-generated content: authorship and inventorship in the age of artificial in- telligence. InOnline Distribution of Content in the EU, pages 117–135. Edward Elgar Publishing, 2019. 1

  5. [5]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Y am Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 1, 6

  6. [6]

    Featuretransfer: Unsupervised domain adaptation for cross-domain deepfake detection.Security and Communication Networks, 2021(1):9942754, 2021

    Baoying Chen and Shunquan Tan. Featuretransfer: Unsupervised domain adaptation for cross-domain deepfake detection.Security and Communication Networks, 2021(1):9942754, 2021. 2, 4

  7. [7]

    Demamba: Ai-generated video detection on million-scale genvideo benchmark, 2024

    Haoxing Chen, Y an Hong, Zizheng Huang, Zhuoer Xu, Zhangx- uan Gu, Y aohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, et al. Demamba: Ai-generated video detection on million- scale genvideo benchmark.arXiv preprint arXiv:2405.19707,

  8. [8]

    Finger: Content aware fine-grained evaluation with reasoning for ai-generated videos.arXiv preprint arXiv:2504.10358, 2025

    Rui Chen, Lei Sun, Jing Tang, Geng Li, and Xiangxiang Chu. Finger: Content aware fine-grained evaluation with reasoning for ai-generated videos.arXiv preprint arXiv:2504.10358, 2025. 2, 4

  9. [9]

    Seine: Short-to-long video diffusion model for generative transition and prediction

    Xinyuan Chen, Y aohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Y u, Y ali Wang, Dahua Lin, Y u Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction. InThe T welfth International Conference on Learning Representations, 2023. 1

  10. [10]

    Can we leave deepfake data behind in training deepfake detector?arXiv preprint arXiv:2408.17052,

    Jikang Cheng, Zhiyuan Y an, Ying Zhang, Y uhao Luo, Zhongyuan Wang, and Chen Li. Can we leave deepfake data behind in training deepfake detector?arXiv preprint arXiv:2408.17052,

  11. [11]

    Intriguing properties of synthetic images: from generative adversarial networks to diffusion models

    Riccardo Corvi, Davide Cozzolino, Giovanni Poggi, Koki Nagano, and Luisa V erdoliva. Intriguing properties of synthetic images: from generative adversarial networks to diffusion models. InProceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, pages 973–982, 2023. 2

  12. [12]

    On the detection of synthetic images generated by diffusion models

    Riccardo Corvi, Davide Cozzolino, Giada Zingarini, Giovanni Poggi, Koki Nagano, and Luisa V erdoliva. On the detection of synthetic images generated by diffusion models. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 2

  13. [13]

    Raising the bar of ai-generated image detection with clip

    Davide Cozzolino, Giovanni Poggi, Riccardo Corvi, Matthias Nießner, and Luisa V erdoliva. Raising the bar of ai-generated image detection with clip. InProceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, pages 4356–4366, 2024. 6

  14. [14]

    Open set synthetic image source attribution.arXiv preprint arXiv:2308.11557, 2023

    Shengbang Fang, Tai D Nguyen, and Matthew C Stamm. Open set synthetic image source attribution.arXiv preprint arXiv:2308.11557, 2023. 2

  15. [15]

    Towards discovery and attribution of open-world gan generated images

    Sharath Girish, Saksham Suri, Sai Saketh Rambhatla, and Abhi- nav Shrivastava. Towards discovery and attribution of open-world gan generated images. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 14094–14103, 2021. 3

  16. [16]

    Spatiotemporal inconsistency learning for deepfake video detection

    Zhihao Gu, Y ang Chen, Taiping Y ao, Shouhong Ding, Jilin Li, Feiyue Huang, and Lizhuang Ma. Spatiotemporal inconsistency learning for deepfake video detection. InProceedings of the 29th ACM international conference on multimedia, pages 3473–3481,

  17. [17]

    Hierarchical fine-grained image forgery detection and localization

    Xiao Guo, Xiaohong Liu, Zhiyuan Ren, Steven Grosz, Iacopo Masi, and Xiaoming Liu. Hierarchical fine-grained image forgery detection and localization. InProceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, pages 3155–3165, 2023. 6

  18. [18]

    Smart mining for deep metric learning

    Ben Harwood, Vijay Kumar BG, Gustavo Carneiro, Ian Reid, and Tom Drummond. Smart mining for deep metric learning. In Proceedings of the IEEE international conference on computer vision, pages 2821–2829, 2017. 4

  19. [19]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016. 4

  20. [20]

    A style-based genera- tor architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based genera- tor architecture for generative adversarial networks. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019. 2

  21. [21]

    Analyzing and improving the image quality of stylegan

    Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119,

  22. [22]

    Rohit Kundu, Hao Xiong, Vishal Mohanty, Athula Balachandran, and Amit K Roy-Chowdhury. Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content.Proceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, 2025. 1, 3

  23. [23]

    Pika art.https://pika.art/, 2022

    Pika Labs. Pika art.https://pika.art/, 2022. 6

  24. [24]

    The tug-of-war be- tween deepfake generation and detection.arXiv preprint arXiv:2407.06174, 2024

    Hannah Lee, Changyeon Lee, Kevin Farhat, Lin Qiu, Steve Geluso, Aerin Kim, and Oren Etzioni. The tug-of-war be- tween deepfake generation and detection.arXiv preprint arXiv:2407.06174, 2024. 1

  25. [25]

    Fakebench: Probing ex- plainable fake image detection via large multimodal models.arXiv preprint arXiv:2404.13306, 2024

    Yixuan Li, Xuelin Liu, Xiaoyang Wang, Bu Sung Lee, Shiqi Wang, Anderson Rocha, and Weisi Lin. Fakebench: Probing explainable fake image detection via large multimodal models. arXiv preprint arXiv:2404.13306, 2024. 1

  26. [26]

    OmniHuman-1: Rethinking the scaling-up of one-stage conditioned human animation models

    Gaojie Lin, Jianwen Jiang, Jiaqi Y ang, Zerong Zheng, and Chao Liang. Omnihuman-1: Rethinking the scaling-up of one- stage conditioned human animation models.arXiv preprint arXiv:2502.01061, 2025. 1

  27. [27]

    Ts2-net: Token shift and selection transformer for text-video retrieval

    Y uqi Liu, Pengfei Xiong, Luhui Xu, Shengming Cao, and Qin Jin. Ts2-net: Token shift and selection transformer for text-video retrieval. InEuropean conference on computer vision, pages 319–335. Springer, 2022. 6 9

  28. [28]

    Domainforensics: Exposing face forgery across domains via bi-directional adaptation.IEEE Trans- actions on Information F orensics and Security, 2024

    Qingxuan Lv, Y uezun Li, Junyu Dong, Sheng Chen, Hui Y u, Huiyu Zhou, and Shu Zhang. Domainforensics: Exposing face forgery across domains via bi-directional adaptation.IEEE Trans- actions on Information F orensics and Security, 2024. 2, 4

  29. [29]

    Hotshot- xl

    John Mullan, Duncan Crawbuck, and Aakash Sastry. Hotshot- xl. https://github.com/hotshotco/hotshot-xl,

  30. [30]

    Towards universal fake image detectors that generalize across generative models

    Utkarsh Ojha, Y uheng Li, and Y ong Jae Lee. Towards universal fake image detectors that generalize across generative models. In Proceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, pages 24480–24489, 2023. 3, 6

  31. [31]

    Sora by openai.https://openai.com/sora/,

    OpenAI. Sora by openai.https://openai.com/sora/,

  32. [32]

    Xai-based detection of adversarial attacks on deepfake detectors.arXiv preprint arXiv:2403.02955, 2024

    Ben Pinhasov, Raz Lapid, Rony Ohayon, Moshe Sipper, and Y ehudit Aperstein. Xai-based detection of adversarial attacks on deepfake detectors.arXiv preprint arXiv:2403.02955, 2024. 2

  33. [33]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution im- age synthesis.arXiv preprint arXiv:2307.01952, 2023. 2

  34. [34]

    Thinking in frequency: Face forgery detection by mining frequency-aware clues

    Y uyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. InEuropean conference on computer vision, pages 86–103. Springer, 2020. 6

  35. [35]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022. 2

  36. [36]

    Ai-generated video purports to show apocalyptic scenes of los angeles wildfires, 2025

    Reuters. Ai-generated video purports to show apocalyptic scenes of los angeles wildfires, 2025. 1

  37. [37]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 10684– 10695, 2022. 2

  38. [38]

    Photore- alistic text-to-image diffusion models with deep language under- standing.Advances in neural information processing systems, 35: 36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gon- tijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photore- alistic text-to-image diffusion models with deep language under- standing.Advances in neural information processing systems, 35: 36479–36494, 2022. 2

  39. [39]

    Facenet: A unified embedding for face recognition and clustering

    Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015. 4

  40. [40]

    De-fake: Detection and attribution of fake images generated by text-to- image generation models

    Zeyang Sha, Zheng Li, Ning Y u, and Y ang Zhang. De-fake: Detection and attribution of fake images generated by text-to- image generation models. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 3418–3432, 2023. 6

  41. [41]

    Generative ai and intellectual property rights

    Jan Smits and Tijn Borghuis. Generative ai and intellectual property rights. InLaw and artificial intelligence: Regulating AI and applying AI in legal practice, pages 323–344. Springer, 2022. 1

  42. [42]

    On learning multi-modal forgery representation for diffusion generated video detection.The Thirty-eighth Annual Conference on Neural Infor- mation Processing Systems, 2024

    Xiufeng Song, Xiao Guo, Jiache Zhang, Qirui Li, Lei Bai, Xi- aoming Liu, Guangtao Zhai, and Xiaohong Liu. On learning multi-modal forgery representation for diffusion generated video detection.The Thirty-eighth Annual Conference on Neural Infor- mation Processing Systems, 2024. 2, 5, 6

  43. [43]

    Morph studio

    Morph Studio. Morph studio. https : / / www . morphstudio.com/, 2024. 1, 6, 7

  44. [44]

    Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detec- tion

    Chuangchuang Tan, Y ao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Y unchao Wei. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detec- tion. InProceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, pages 28130–28139, 2024. 3, 6

  45. [45]

    C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection

    Chuangchuang Tan, Renshuai Tao, Huan Liu, Guanghua Gu, Baoyuan Wu, Y ao Zhao, and Y unchao Wei. C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 7184–7192, 2025. 3

  46. [46]

    Beyond deepfake images: Detecting ai- generated videos

    Danial Samadi V ahdati, Tai D Nguyen, Aref Azizpour, and Matthew C Stamm. Beyond deepfake images: Detecting ai- generated videos. InProceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, pages 4397–4408,

  47. [47]

    Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

    Laurens V an der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008. 6

  48. [48]

    Attention is all you need.NeurIPS,

    A V aswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A Gomez, Ł Kaiser, and I Polosukhin. Attention is all you need.NeurIPS,

  49. [49]

    ModelScope Text-to-Video Technical Report

    Jiuniu Wang, Hangjie Y uan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023. 1

  50. [50]

    Cnn-generated images are surprisingly easy to spot

    Sheng-Y u Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are surprisingly easy to spot... for now. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8695–8704,

  51. [51]

    Dire for diffusion- generated image detection

    Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion- generated image detection. InProceedings of the IEEE/CVF In- ternational Conference on Computer V ision, pages 22445–22455,

  52. [52]

    Where did i come from? origin attribution of ai-generated images.Advances in neural information processing systems, 36: 74478–74500, 2023

    Zhenting Wang, Chen Chen, Yi Zeng, Lingjuan Lyu, and Shiqing Ma. Where did i come from? origin attribution of ai-generated images.Advances in neural information processing systems, 36: 74478–74500, 2023. 1, 2, 3

  53. [53]

    Dynamicrafter: Animating open-domain im- ages with video diffusion priors

    Jinbo Xing, Menghan Xia, Y ong Zhang, Haoxin Chen, Wangbo Y u, Hanyuan Liu, Gongye Liu, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Dynamicrafter: Animating open-domain im- ages with video diffusion priors. InEuropean Conference on Computer V ision, pages 399–417. Springer, 2024. 1

  54. [54]

    Tall: Thumbnail layout for deepfake video detection

    Y uting Xu, Jian Liang, Gengyun Jia, Ziming Y ang, Y anhao Zhang, and Ran He. Tall: Thumbnail layout for deepfake video detection. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 22658–22668, 2023. 1, 6

  55. [55]

    Improved em- beddings with easy positive triplet mining

    Hong Xuan, Abby Stylianou, and Robert Pless. Improved em- beddings with easy positive triplet mining. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer V ision, pages 2474–2482, 2020. 4

  56. [56]

    Deepfake network architecture attribution

    Tianyun Y ang, Ziyao Huang, Juan Cao, Lei Li, and Xirong Li. Deepfake network architecture attribution. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4662–4670,

  57. [57]

    Progressive open space expansion for open-set model attribution

    Tianyun Y ang, Danding Wang, Fan Tang, Xinying Zhao, Juan Cao, and Sheng Tang. Progressive open space expansion for open-set model attribution. InProceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, pages 15856–15865, 2023. 3

  58. [58]

    Show-1: Marrying pixel and latent diffusion models for text- to-video generation.International Journal of Computer V ision, pages 1–15, 2024

    David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Y uchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text- to-video generation.International Journal of Computer V ision, pages 1–15, 2024. 7

  59. [59]

    Open-sora: Democratizing efficient video production for all, 2024

    Zangwei Zheng, Xiangyu Peng, Tianji Y ang, Chenhui Shen, Shenggui Li, Hongxin Liu, Y ukun Zhou, Tianyi Li, and Y ang Y ou. Open-sora: Democratizing efficient video production for all, 2024. 1

  60. [60]

    Adversarial attacks and defenses in deep learning: From a perspective of cybersecurity.ACM Computing Surveys, 55(8):1–39, 2022

    Shuai Zhou, Chi Liu, Dayong Y e, Tianqing Zhu, Wanlei Zhou, and Philip S Y u. Adversarial attacks and defenses in deep learning: From a perspective of cybersecurity.ACM Computing Surveys, 55(8):1–39, 2022. 2

  61. [61]

    Unpaired image-to-image translation using cycle-consistent ad- versarial networks

    Jun-Y an Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent ad- versarial networks. InProceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017. 2 11