SAGA: Source Attribution of Generative AI Videos

Amit K. Roy-Chowdhury; Athula Balachandran; Hao Xiong; Rohit Kundu; Shan Jia; Vishal Mohanty

arxiv: 2511.12834 · v2 · submitted 2025-11-16 · 💻 cs.CV · cs.AI

SAGA: Source Attribution of Generative AI Videos

Rohit Kundu , Vishal Mohanty , Hao Xiong , Shan Jia , Athula Balachandran , Amit K. Roy-Chowdhury This is my paper

Pith reviewed 2026-05-17 21:28 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords source attributiongenerative AI videosvideo forensicstemporal attention signaturesdata-efficient learningsynthetic video provenancemulti-granular attribution

0 comments

The pith

SAGA attributes generative AI videos to their exact source model using only 0.5 percent labeled data per class.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SAGA to move beyond binary real-versus-fake detection by identifying the precise generative model that produced a synthetic video. It does so at five levels of granularity: whether the video is authentic, which task created it, which model version, which team developed it, and which generator was used. The method rests on a video transformer that pulls spatio-temporal artifacts from a robust vision foundation model and a pretrain-and-attribute strategy that reaches state-of-the-art accuracy with only 0.5 percent of the usual labeled data per class. It also supplies Temporal Attention Signatures that visualize the temporal patterns distinguishing one generator from another. A reader would care because hyper-realistic synthetic videos already outpace simple detectors and now require traceable provenance for forensic and regulatory use.

Core claim

SAGA is the first framework for multi-granular source attribution of generative AI videos across authenticity, generation task such as text-to-video or image-to-video, model version, development team, and the exact generator. Its video transformer architecture extracts distinguishing spatio-temporal artifacts from features of a robust vision foundation model, while a data-efficient pretrain-and-attribute strategy achieves state-of-the-art performance with only 0.5 percent source-labeled data per class and matches fully supervised results. Temporal Attention Signatures provide the first visual explanation of why different video generators remain distinguishable by highlighting learned timing,

What carries the argument

The data-efficient pretrain-and-attribute strategy combined with Temporal Attention Signatures inside a video transformer that processes features from a robust vision foundation model to isolate stable spatio-temporal artifacts.

Load-bearing premise

Spatio-temporal artifacts extracted from a robust vision foundation model stay unique, stable, and transferable enough across generators and domains to support accurate attribution even when labeled data is reduced to 0.5 percent per class.

What would settle it

Apply SAGA to videos produced by a new generator unseen during training and measure whether attribution accuracy drops well below the fully supervised baseline.

Figures

Figures reproduced from arXiv: 2511.12834 by Amit K. Roy-Chowdhury, Athula Balachandran, Hao Xiong, Rohit Kundu, Shan Jia, Vishal Mohanty.

**Figure 1.** Figure 1: SAGA: Data-Efficient & Interpretable AI Video Source Attribution. (a) Temporal Attention Signatures (T-Sigs): SAGA pioneers AI video source attribution. Our novel T-Sigs provide interpretability, showing unique fingerprints for Real, Seen, and even Unseen generators. (b) Feature Separability: t-SNE visualization of learned features demonstrates clear generator clusters. (c) Multi-Granular Performance & Dat… view at source ↗

**Figure 2.** Figure 2: Overall framework of SAGA with a two-stage training approach. In Stage-1, each video xk with real/fake labels is processed through a frozen foundational vision encoder to extract image-level features zm, which are stacked in temporal order to form the video representation ζk. Positional encoding is added, and the sequence is passed through our video transformer architecture θ (Sec. 3.1) to obtain ϕk. The c… view at source ↗

**Figure 3.** Figure 3: HNM enables better separation boundaries between classes while semiHNM will exclude these samples from the loss. This focuses the model on the most challenging negatives within the batch. In our source attribution task, some generators produced embeddings with overlapping t-SNE clusters when trained with CE-loss alone ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: t-SNE visualization of SAGA’s learned representations trained on the TASK-L, BIN-L, SD-L and TEAM-L attribution tasks, respectively. Even when supervised at coarser levels, SAGA distinctly clusters individual generators, revealing strong fine-grained discriminative ability [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: t-SNE visualization of SAGA on the GEN-L attribution task with different loss functions [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: T-Sigs for classes in the different attribution levels. level. This level of separation indicates that the model is sensitive to subtle distributional differences introduced by specific generator architectures or research teams, enabling it to infer whether an unknown generator shares an SD backbone or team affiliation, or represents a completely novel source. The t-SNE analysis for the GEN-L attribution … view at source ↗

read the original abstract

The proliferation of generative AI has led to hyper-realistic synthetic videos, escalating misuse risks and outstripping binary real/fake detectors. We introduce SAGA (Source Attribution of Generative AI videos), the first comprehensive framework to address the urgent need for AI-generated video source attribution at a large scale. Unlike traditional detection, SAGA identifies the specific generative model used. It uniquely provides multi-granular attribution across five levels: authenticity, generation task (e.g., T2V/I2V), model version, development team, and the precise generator, offering far richer forensic insights. Our novel video transformer architecture, leveraging features from a robust vision foundation model, effectively captures spatio-temporal artifacts. Critically, we introduce a data-efficient pretrain-and-attribute strategy, enabling SAGA to achieve state-of-the-art attribution using only 0.5\% of source-labeled data per class, matching fully supervised performance. Furthermore, we propose Temporal Attention Signatures (T-Sigs), a novel interpretability method that visualizes learned temporal differences, offering the first explanation for why different video generators are distinguishable. Extensive experiments on public datasets, including cross-domain scenarios, demonstrate that SAGA sets a new benchmark for synthetic video provenance, providing crucial, interpretable insights for forensic and regulatory applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAGA introduces multi-granular attribution for generative videos plus a low-data strategy and T-Sigs, but the results need tighter checks against content and prompt confounds to support the cross-domain claims.

read the letter

Hi, The punchline is that SAGA gives us a way to attribute generative videos to their source models at multiple levels of granularity, with a data-efficient method that reportedly matches full supervision using just 0.5 percent of the labels, plus a new interpretability technique. What the paper does is introduce this five-level framework covering authenticity, task type, model version, team, and specific generator. It uses a video transformer on features from a vision foundation model to pick up spatio-temporal artifacts. The pretrain-and-attribute strategy is the key to the low-data performance, and T-Sigs are meant to show why generators differ in their temporal patterns. This is a solid step toward more useful forensic tools. Binary detection is no longer enough as generators get better, and having richer attribution helps with tracking misuse and regulation. The focus on interpretability is a plus for practical adoption. The soft spots are around the strength of the evidence for those performance claims. The abstract is light on specifics like exact baselines, error bars, or detailed ablations, so it's hard to see how much the results depend on the particular datasets or if they generalize. The concern about whether the extracted artifacts are truly generator-specific or could be picking up content biases or prompt distributions is worth taking seriously. If the cross-domain experiments don't include controls for video length, resolution, or semantic content, the multi-granular attribution could be less reliable than it appears. The paper would be stronger with explicit tests ruling that out. Overall the approach is empirical and seems to engage honestly with the literature on vision models and attribution. This work is for computer vision researchers and practitioners in digital forensics. Someone working on synthetic media detection or platform tools would get value from the ideas, even if they end up adapting the method. It deserves a serious referee because the problem is timely and the proposed framework is concrete enough to evaluate and improve. I would recommend sending it to peer review with requests for more detailed results and confound checks. Best regards, A colleague

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SAGA, a framework for multi-granular source attribution of generative AI videos across five levels (authenticity, generation task such as T2V/I2V, model version, development team, and precise generator). It proposes a video transformer architecture leveraging features from a robust vision foundation model to capture spatio-temporal artifacts, combined with a data-efficient pretrain-and-attribute strategy. The central claims are that this achieves state-of-the-art attribution performance using only 0.5% of source-labeled data per class while matching fully supervised results, and that the novel Temporal Attention Signatures (T-Sigs) provide the first explanation for generator distinguishability. Experiments on public datasets including cross-domain scenarios are said to support these results.

Significance. If the performance and interpretability claims hold after addressing controls for confounds, this would advance AI-generated video forensics beyond binary detection by enabling precise provenance tracking with minimal supervision and offering visual explanations of model-specific artifacts. Such capabilities could support regulatory and forensic applications in a domain where generative video misuse is growing rapidly.

major comments (2)

[Abstract] Abstract: The claim that SAGA achieves state-of-the-art attribution matching fully supervised performance with only 0.5% source-labeled data per class supplies no quantitative details on baselines, error bars, data splits, or ablation studies. This information is load-bearing for evaluating whether the empirical results support the central data-efficiency claim.
[Abstract] Abstract (cross-domain scenarios): The reported cross-domain results do not include explicit controls to demonstrate that the spatio-temporal artifacts extracted from the vision foundation model are dominated by stable, generator-specific temporal signatures rather than content statistics, prompt distributions, or video length/resolution cues. Without such controls, the multi-granular attribution performance (including version/team-level) could be undermined by distribution shift, directly affecting the weakest assumption underlying both the pretrain-and-attribute pipeline and T-Sigs.

minor comments (1)

The five attribution granularity levels are listed in the abstract but would benefit from an early table or diagram defining each level with examples to improve clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which help us improve the clarity and robustness of our work. We address the major comments point by point below, proposing revisions to the manuscript where necessary.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that SAGA achieves state-of-the-art attribution matching fully supervised performance with only 0.5% source-labeled data per class supplies no quantitative details on baselines, error bars, data splits, or ablation studies. This information is load-bearing for evaluating whether the empirical results support the central data-efficiency claim.

Authors: We agree that incorporating quantitative details into the abstract would strengthen the presentation of our central claim. In the revised manuscript, we will update the abstract to include specific metrics, such as the top-1 attribution accuracy with 0.5% labeled data compared to fully supervised baselines, mention the use of standard data splits, and note that error bars and ablation studies are detailed in the experimental sections. This revision will provide the necessary context without exceeding abstract length constraints. revision: yes
Referee: [Abstract] Abstract (cross-domain scenarios): The reported cross-domain results do not include explicit controls to demonstrate that the spatio-temporal artifacts extracted from the vision foundation model are dominated by stable, generator-specific temporal signatures rather than content statistics, prompt distributions, or video length/resolution cues. Without such controls, the multi-granular attribution performance (including version/team-level) could be undermined by distribution shift, directly affecting the weakest assumption underlying both the pretrain-and-attribute pipeline and T-Sigs.

Authors: We appreciate this important point on potential confounds. Our experiments across public datasets already incorporate variations in content, prompts, and video properties to test generalization. The Temporal Attention Signatures (T-Sigs) are introduced precisely to highlight generator-specific temporal patterns independent of content. To directly address the referee's concern, we will add explicit control experiments in the revision, such as evaluations on content-matched video pairs or ablations removing temporal components, to confirm that the attribution relies on stable generator signatures rather than spurious cues. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture and experimental validation

full rationale

The paper introduces an empirical video transformer architecture that extracts spatio-temporal features from a vision foundation model, combined with a pretrain-and-attribute training strategy. Central performance claims (SOTA attribution at 0.5% labeled data per class, multi-granular results, and cross-domain generalization) are presented as outcomes of extensive experiments on public datasets rather than as quantities derived by construction from the paper's own equations or definitions. Temporal Attention Signatures are proposed as a post-hoc interpretability visualization of learned temporal differences, with no indication that they reduce to fitted parameters or self-referential inputs. No load-bearing self-citations, uniqueness theorems, or ansatzes smuggled via prior author work are invoked to force the results; the derivation chain remains self-contained through empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so explicit free parameters, axioms, and invented entities cannot be audited in detail. The framework introduces methodological innovations (T-Sigs, pretrain-and-attribute) rather than new physical entities; standard deep-learning assumptions about feature uniqueness are implicit but not enumerated.

pith-pipeline@v0.9.0 · 5542 in / 1168 out tokens · 28814 ms · 2026-05-17T21:28:12.404686+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

novel video transformer architecture... Temporal Attention Signatures (T-Sigs)... data-efficient pretrain-and-attribute strategy... Hard Negative Mining (HNM) objective
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

T-Sigs... unique fingerprints for Real, Seen, and even Unseen generators

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Who Generated This 3D Asset? Learning Source Attribution for Generative 3D Models
cs.CV 2026-05 unverdicted novelty 7.0

Introduces the first passive source attribution benchmark for 22 generative 3D models and a Transformer achieving 97.22% accuracy under full supervision and 77.17% with 1% training data.
Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection
cs.CV 2026-05 unverdicted novelty 5.0

VINA trains a single detector on images plus video frames using a cross-modal supervised contrastive objective, yielding bidirectional gains and SOTA results on 14 image, video, and in-the-wild benchmarks.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 2 Pith papers · 5 internal anchors

[1]

Getting vit in shape: Scaling laws for compute- optimal model design.Advances in Neural Information Process- ing Systems, 36, 2024

Ibrahim M Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, and Lucas Beyer. Getting vit in shape: Scaling laws for compute- optimal model design.Advances in Neural Information Process- ing Systems, 36, 2024. 2

work page 2024
[2]

Deepfake media forensics: State of the art and challenges ahead.arXiv preprint arXiv:2408.00388, 2024

Irene Amerini, Mauro Barni, Sebastiano Battiato, Paolo Bestagini, Giulia Boato, Tania Sari Bonaventura, Vittoria Bruni, Roberto Caldelli, Francesco De Natale, Rocco De Nicola, et al. Deepfake media forensics: State of the art and challenges ahead.arXiv preprint arXiv:2408.00388, 2024. 1

work page arXiv 2024
[3]

Vivit: A video vision trans- former

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luˇci´c, and Cordelia Schmid. Vivit: A video vision trans- former. InProceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021. 6

work page 2021
[4]

Ai-generated content: authorship and inventorship in the age of artificial in- telligence

Rosa Maria Ballardini, Kan He, and Teemu Roos. Ai-generated content: authorship and inventorship in the age of artificial in- telligence. InOnline Distribution of Content in the EU, pages 117–135. Edward Elgar Publishing, 2019. 1

work page 2019
[5]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Y am Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Featuretransfer: Unsupervised domain adaptation for cross-domain deepfake detection.Security and Communication Networks, 2021(1):9942754, 2021

Baoying Chen and Shunquan Tan. Featuretransfer: Unsupervised domain adaptation for cross-domain deepfake detection.Security and Communication Networks, 2021(1):9942754, 2021. 2, 4

work page 2021
[7]

Demamba: Ai-generated video detection on million-scale genvideo benchmark, 2024

Haoxing Chen, Y an Hong, Zizheng Huang, Zhuoer Xu, Zhangx- uan Gu, Y aohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, et al. Demamba: Ai-generated video detection on million- scale genvideo benchmark.arXiv preprint arXiv:2405.19707,

work page arXiv
[8]

Finger: Content aware fine-grained evaluation with reasoning for ai-generated videos.arXiv preprint arXiv:2504.10358, 2025

Rui Chen, Lei Sun, Jing Tang, Geng Li, and Xiangxiang Chu. Finger: Content aware fine-grained evaluation with reasoning for ai-generated videos.arXiv preprint arXiv:2504.10358, 2025. 2, 4

work page arXiv 2025
[9]

Seine: Short-to-long video diffusion model for generative transition and prediction

Xinyuan Chen, Y aohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Y u, Y ali Wang, Dahua Lin, Y u Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction. InThe T welfth International Conference on Learning Representations, 2023. 1

work page 2023
[10]

Can we leave deepfake data behind in training deepfake detector?arXiv preprint arXiv:2408.17052,

Jikang Cheng, Zhiyuan Y an, Ying Zhang, Y uhao Luo, Zhongyuan Wang, and Chen Li. Can we leave deepfake data behind in training deepfake detector?arXiv preprint arXiv:2408.17052,

work page arXiv
[11]

Intriguing properties of synthetic images: from generative adversarial networks to diffusion models

Riccardo Corvi, Davide Cozzolino, Giovanni Poggi, Koki Nagano, and Luisa V erdoliva. Intriguing properties of synthetic images: from generative adversarial networks to diffusion models. InProceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, pages 973–982, 2023. 2

work page 2023
[12]

On the detection of synthetic images generated by diffusion models

Riccardo Corvi, Davide Cozzolino, Giada Zingarini, Giovanni Poggi, Koki Nagano, and Luisa V erdoliva. On the detection of synthetic images generated by diffusion models. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 2

work page 2023
[13]

Raising the bar of ai-generated image detection with clip

Davide Cozzolino, Giovanni Poggi, Riccardo Corvi, Matthias Nießner, and Luisa V erdoliva. Raising the bar of ai-generated image detection with clip. InProceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, pages 4356–4366, 2024. 6

work page 2024
[14]

Open set synthetic image source attribution.arXiv preprint arXiv:2308.11557, 2023

Shengbang Fang, Tai D Nguyen, and Matthew C Stamm. Open set synthetic image source attribution.arXiv preprint arXiv:2308.11557, 2023. 2

work page arXiv 2023
[15]

Towards discovery and attribution of open-world gan generated images

Sharath Girish, Saksham Suri, Sai Saketh Rambhatla, and Abhi- nav Shrivastava. Towards discovery and attribution of open-world gan generated images. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 14094–14103, 2021. 3

work page 2021
[16]

Spatiotemporal inconsistency learning for deepfake video detection

Zhihao Gu, Y ang Chen, Taiping Y ao, Shouhong Ding, Jilin Li, Feiyue Huang, and Lizhuang Ma. Spatiotemporal inconsistency learning for deepfake video detection. InProceedings of the 29th ACM international conference on multimedia, pages 3473–3481,

work page
[17]

Hierarchical fine-grained image forgery detection and localization

Xiao Guo, Xiaohong Liu, Zhiyuan Ren, Steven Grosz, Iacopo Masi, and Xiaoming Liu. Hierarchical fine-grained image forgery detection and localization. InProceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, pages 3155–3165, 2023. 6

work page 2023
[18]

Smart mining for deep metric learning

Ben Harwood, Vijay Kumar BG, Gustavo Carneiro, Ian Reid, and Tom Drummond. Smart mining for deep metric learning. In Proceedings of the IEEE international conference on computer vision, pages 2821–2829, 2017. 4

work page 2017
[19]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016. 4

work page internal anchor Pith review Pith/arXiv arXiv 2016
[20]

A style-based genera- tor architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based genera- tor architecture for generative adversarial networks. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019. 2

work page 2019
[21]

Analyzing and improving the image quality of stylegan

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119,

work page
[22]

Rohit Kundu, Hao Xiong, Vishal Mohanty, Athula Balachandran, and Amit K Roy-Chowdhury. Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content.Proceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, 2025. 1, 3

work page 2025
[23]

Pika art.https://pika.art/, 2022

Pika Labs. Pika art.https://pika.art/, 2022. 6

work page 2022
[24]

The tug-of-war be- tween deepfake generation and detection.arXiv preprint arXiv:2407.06174, 2024

Hannah Lee, Changyeon Lee, Kevin Farhat, Lin Qiu, Steve Geluso, Aerin Kim, and Oren Etzioni. The tug-of-war be- tween deepfake generation and detection.arXiv preprint arXiv:2407.06174, 2024. 1

work page arXiv 2024
[25]

Fakebench: Probing ex- plainable fake image detection via large multimodal models.arXiv preprint arXiv:2404.13306, 2024

Yixuan Li, Xuelin Liu, Xiaoyang Wang, Bu Sung Lee, Shiqi Wang, Anderson Rocha, and Weisi Lin. Fakebench: Probing explainable fake image detection via large multimodal models. arXiv preprint arXiv:2404.13306, 2024. 1

work page arXiv 2024
[26]

OmniHuman-1: Rethinking the scaling-up of one-stage conditioned human animation models

Gaojie Lin, Jianwen Jiang, Jiaqi Y ang, Zerong Zheng, and Chao Liang. Omnihuman-1: Rethinking the scaling-up of one- stage conditioned human animation models.arXiv preprint arXiv:2502.01061, 2025. 1

work page arXiv 2025
[27]

Ts2-net: Token shift and selection transformer for text-video retrieval

Y uqi Liu, Pengfei Xiong, Luhui Xu, Shengming Cao, and Qin Jin. Ts2-net: Token shift and selection transformer for text-video retrieval. InEuropean conference on computer vision, pages 319–335. Springer, 2022. 6 9

work page 2022
[28]

Domainforensics: Exposing face forgery across domains via bi-directional adaptation.IEEE Trans- actions on Information F orensics and Security, 2024

Qingxuan Lv, Y uezun Li, Junyu Dong, Sheng Chen, Hui Y u, Huiyu Zhou, and Shu Zhang. Domainforensics: Exposing face forgery across domains via bi-directional adaptation.IEEE Trans- actions on Information F orensics and Security, 2024. 2, 4

work page 2024
[29]

Hotshot- xl

John Mullan, Duncan Crawbuck, and Aakash Sastry. Hotshot- xl. https://github.com/hotshotco/hotshot-xl,

work page
[30]

Towards universal fake image detectors that generalize across generative models

Utkarsh Ojha, Y uheng Li, and Y ong Jae Lee. Towards universal fake image detectors that generalize across generative models. In Proceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, pages 24480–24489, 2023. 3, 6

work page 2023
[31]

Sora by openai.https://openai.com/sora/,

OpenAI. Sora by openai.https://openai.com/sora/,

work page
[32]

Xai-based detection of adversarial attacks on deepfake detectors.arXiv preprint arXiv:2403.02955, 2024

Ben Pinhasov, Raz Lapid, Rony Ohayon, Moshe Sipper, and Y ehudit Aperstein. Xai-based detection of adversarial attacks on deepfake detectors.arXiv preprint arXiv:2403.02955, 2024. 2

work page arXiv 2024
[33]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution im- age synthesis.arXiv preprint arXiv:2307.01952, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Thinking in frequency: Face forgery detection by mining frequency-aware clues

Y uyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. InEuropean conference on computer vision, pages 86–103. Springer, 2020. 6

work page 2020
[35]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Ai-generated video purports to show apocalyptic scenes of los angeles wildfires, 2025

Reuters. Ai-generated video purports to show apocalyptic scenes of los angeles wildfires, 2025. 1

work page 2025
[37]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 10684– 10695, 2022. 2

work page 2022
[38]

Photore- alistic text-to-image diffusion models with deep language under- standing.Advances in neural information processing systems, 35: 36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gon- tijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photore- alistic text-to-image diffusion models with deep language under- standing.Advances in neural information processing systems, 35: 36479–36494, 2022. 2

work page 2022
[39]

Facenet: A unified embedding for face recognition and clustering

Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015. 4

work page 2015
[40]

De-fake: Detection and attribution of fake images generated by text-to- image generation models

Zeyang Sha, Zheng Li, Ning Y u, and Y ang Zhang. De-fake: Detection and attribution of fake images generated by text-to- image generation models. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 3418–3432, 2023. 6

work page 2023
[41]

Generative ai and intellectual property rights

Jan Smits and Tijn Borghuis. Generative ai and intellectual property rights. InLaw and artificial intelligence: Regulating AI and applying AI in legal practice, pages 323–344. Springer, 2022. 1

work page 2022
[42]

On learning multi-modal forgery representation for diffusion generated video detection.The Thirty-eighth Annual Conference on Neural Infor- mation Processing Systems, 2024

Xiufeng Song, Xiao Guo, Jiache Zhang, Qirui Li, Lei Bai, Xi- aoming Liu, Guangtao Zhai, and Xiaohong Liu. On learning multi-modal forgery representation for diffusion generated video detection.The Thirty-eighth Annual Conference on Neural Infor- mation Processing Systems, 2024. 2, 5, 6

work page 2024
[43]

Morph studio

Morph Studio. Morph studio. https : / / www . morphstudio.com/, 2024. 1, 6, 7

work page 2024
[44]

Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detec- tion

Chuangchuang Tan, Y ao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Y unchao Wei. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detec- tion. InProceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, pages 28130–28139, 2024. 3, 6

work page 2024
[45]

C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection

Chuangchuang Tan, Renshuai Tao, Huan Liu, Guanghua Gu, Baoyuan Wu, Y ao Zhao, and Y unchao Wei. C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 7184–7192, 2025. 3

work page 2025
[46]

Beyond deepfake images: Detecting ai- generated videos

Danial Samadi V ahdati, Tai D Nguyen, Aref Azizpour, and Matthew C Stamm. Beyond deepfake images: Detecting ai- generated videos. InProceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, pages 4397–4408,

work page
[47]

Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

Laurens V an der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008. 6

work page 2008
[48]

Attention is all you need.NeurIPS,

A V aswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A Gomez, Ł Kaiser, and I Polosukhin. Attention is all you need.NeurIPS,

work page
[49]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Y uan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Cnn-generated images are surprisingly easy to spot

Sheng-Y u Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are surprisingly easy to spot... for now. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8695–8704,

work page
[51]

Dire for diffusion- generated image detection

Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion- generated image detection. InProceedings of the IEEE/CVF In- ternational Conference on Computer V ision, pages 22445–22455,

work page
[52]

Where did i come from? origin attribution of ai-generated images.Advances in neural information processing systems, 36: 74478–74500, 2023

Zhenting Wang, Chen Chen, Yi Zeng, Lingjuan Lyu, and Shiqing Ma. Where did i come from? origin attribution of ai-generated images.Advances in neural information processing systems, 36: 74478–74500, 2023. 1, 2, 3

work page 2023
[53]

Dynamicrafter: Animating open-domain im- ages with video diffusion priors

Jinbo Xing, Menghan Xia, Y ong Zhang, Haoxin Chen, Wangbo Y u, Hanyuan Liu, Gongye Liu, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Dynamicrafter: Animating open-domain im- ages with video diffusion priors. InEuropean Conference on Computer V ision, pages 399–417. Springer, 2024. 1

work page 2024
[54]

Tall: Thumbnail layout for deepfake video detection

Y uting Xu, Jian Liang, Gengyun Jia, Ziming Y ang, Y anhao Zhang, and Ran He. Tall: Thumbnail layout for deepfake video detection. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 22658–22668, 2023. 1, 6

work page 2023
[55]

Improved em- beddings with easy positive triplet mining

Hong Xuan, Abby Stylianou, and Robert Pless. Improved em- beddings with easy positive triplet mining. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer V ision, pages 2474–2482, 2020. 4

work page 2020
[56]

Deepfake network architecture attribution

Tianyun Y ang, Ziyao Huang, Juan Cao, Lei Li, and Xirong Li. Deepfake network architecture attribution. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4662–4670,

work page
[57]

Progressive open space expansion for open-set model attribution

Tianyun Y ang, Danding Wang, Fan Tang, Xinying Zhao, Juan Cao, and Sheng Tang. Progressive open space expansion for open-set model attribution. InProceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, pages 15856–15865, 2023. 3

work page 2023
[58]

Show-1: Marrying pixel and latent diffusion models for text- to-video generation.International Journal of Computer V ision, pages 1–15, 2024

David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Y uchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text- to-video generation.International Journal of Computer V ision, pages 1–15, 2024. 7

work page 2024
[59]

Open-sora: Democratizing efficient video production for all, 2024

Zangwei Zheng, Xiangyu Peng, Tianji Y ang, Chenhui Shen, Shenggui Li, Hongxin Liu, Y ukun Zhou, Tianyi Li, and Y ang Y ou. Open-sora: Democratizing efficient video production for all, 2024. 1

work page 2024
[60]

Adversarial attacks and defenses in deep learning: From a perspective of cybersecurity.ACM Computing Surveys, 55(8):1–39, 2022

Shuai Zhou, Chi Liu, Dayong Y e, Tianqing Zhu, Wanlei Zhou, and Philip S Y u. Adversarial attacks and defenses in deep learning: From a perspective of cybersecurity.ACM Computing Surveys, 55(8):1–39, 2022. 2

work page 2022
[61]

Unpaired image-to-image translation using cycle-consistent ad- versarial networks

Jun-Y an Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent ad- versarial networks. InProceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017. 2 11

work page 2017

[1] [1]

Getting vit in shape: Scaling laws for compute- optimal model design.Advances in Neural Information Process- ing Systems, 36, 2024

Ibrahim M Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, and Lucas Beyer. Getting vit in shape: Scaling laws for compute- optimal model design.Advances in Neural Information Process- ing Systems, 36, 2024. 2

work page 2024

[2] [2]

Deepfake media forensics: State of the art and challenges ahead.arXiv preprint arXiv:2408.00388, 2024

Irene Amerini, Mauro Barni, Sebastiano Battiato, Paolo Bestagini, Giulia Boato, Tania Sari Bonaventura, Vittoria Bruni, Roberto Caldelli, Francesco De Natale, Rocco De Nicola, et al. Deepfake media forensics: State of the art and challenges ahead.arXiv preprint arXiv:2408.00388, 2024. 1

work page arXiv 2024

[3] [3]

Vivit: A video vision trans- former

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luˇci´c, and Cordelia Schmid. Vivit: A video vision trans- former. InProceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021. 6

work page 2021

[4] [4]

Ai-generated content: authorship and inventorship in the age of artificial in- telligence

Rosa Maria Ballardini, Kan He, and Teemu Roos. Ai-generated content: authorship and inventorship in the age of artificial in- telligence. InOnline Distribution of Content in the EU, pages 117–135. Edward Elgar Publishing, 2019. 1

work page 2019

[5] [5]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Y am Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Featuretransfer: Unsupervised domain adaptation for cross-domain deepfake detection.Security and Communication Networks, 2021(1):9942754, 2021

Baoying Chen and Shunquan Tan. Featuretransfer: Unsupervised domain adaptation for cross-domain deepfake detection.Security and Communication Networks, 2021(1):9942754, 2021. 2, 4

work page 2021

[7] [7]

Demamba: Ai-generated video detection on million-scale genvideo benchmark, 2024

Haoxing Chen, Y an Hong, Zizheng Huang, Zhuoer Xu, Zhangx- uan Gu, Y aohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, et al. Demamba: Ai-generated video detection on million- scale genvideo benchmark.arXiv preprint arXiv:2405.19707,

work page arXiv

[8] [8]

Finger: Content aware fine-grained evaluation with reasoning for ai-generated videos.arXiv preprint arXiv:2504.10358, 2025

Rui Chen, Lei Sun, Jing Tang, Geng Li, and Xiangxiang Chu. Finger: Content aware fine-grained evaluation with reasoning for ai-generated videos.arXiv preprint arXiv:2504.10358, 2025. 2, 4

work page arXiv 2025

[9] [9]

Seine: Short-to-long video diffusion model for generative transition and prediction

Xinyuan Chen, Y aohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Y u, Y ali Wang, Dahua Lin, Y u Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction. InThe T welfth International Conference on Learning Representations, 2023. 1

work page 2023

[10] [10]

Can we leave deepfake data behind in training deepfake detector?arXiv preprint arXiv:2408.17052,

Jikang Cheng, Zhiyuan Y an, Ying Zhang, Y uhao Luo, Zhongyuan Wang, and Chen Li. Can we leave deepfake data behind in training deepfake detector?arXiv preprint arXiv:2408.17052,

work page arXiv

[11] [11]

Intriguing properties of synthetic images: from generative adversarial networks to diffusion models

Riccardo Corvi, Davide Cozzolino, Giovanni Poggi, Koki Nagano, and Luisa V erdoliva. Intriguing properties of synthetic images: from generative adversarial networks to diffusion models. InProceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, pages 973–982, 2023. 2

work page 2023

[12] [12]

On the detection of synthetic images generated by diffusion models

Riccardo Corvi, Davide Cozzolino, Giada Zingarini, Giovanni Poggi, Koki Nagano, and Luisa V erdoliva. On the detection of synthetic images generated by diffusion models. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 2

work page 2023

[13] [13]

Raising the bar of ai-generated image detection with clip

Davide Cozzolino, Giovanni Poggi, Riccardo Corvi, Matthias Nießner, and Luisa V erdoliva. Raising the bar of ai-generated image detection with clip. InProceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, pages 4356–4366, 2024. 6

work page 2024

[14] [14]

Open set synthetic image source attribution.arXiv preprint arXiv:2308.11557, 2023

Shengbang Fang, Tai D Nguyen, and Matthew C Stamm. Open set synthetic image source attribution.arXiv preprint arXiv:2308.11557, 2023. 2

work page arXiv 2023

[15] [15]

Towards discovery and attribution of open-world gan generated images

Sharath Girish, Saksham Suri, Sai Saketh Rambhatla, and Abhi- nav Shrivastava. Towards discovery and attribution of open-world gan generated images. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 14094–14103, 2021. 3

work page 2021

[16] [16]

Spatiotemporal inconsistency learning for deepfake video detection

Zhihao Gu, Y ang Chen, Taiping Y ao, Shouhong Ding, Jilin Li, Feiyue Huang, and Lizhuang Ma. Spatiotemporal inconsistency learning for deepfake video detection. InProceedings of the 29th ACM international conference on multimedia, pages 3473–3481,

work page

[17] [17]

Hierarchical fine-grained image forgery detection and localization

Xiao Guo, Xiaohong Liu, Zhiyuan Ren, Steven Grosz, Iacopo Masi, and Xiaoming Liu. Hierarchical fine-grained image forgery detection and localization. InProceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, pages 3155–3165, 2023. 6

work page 2023

[18] [18]

Smart mining for deep metric learning

Ben Harwood, Vijay Kumar BG, Gustavo Carneiro, Ian Reid, and Tom Drummond. Smart mining for deep metric learning. In Proceedings of the IEEE international conference on computer vision, pages 2821–2829, 2017. 4

work page 2017

[19] [19]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016. 4

work page internal anchor Pith review Pith/arXiv arXiv 2016

[20] [20]

A style-based genera- tor architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based genera- tor architecture for generative adversarial networks. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019. 2

work page 2019

[21] [21]

Analyzing and improving the image quality of stylegan

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119,

work page

[22] [22]

Rohit Kundu, Hao Xiong, Vishal Mohanty, Athula Balachandran, and Amit K Roy-Chowdhury. Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content.Proceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, 2025. 1, 3

work page 2025

[23] [23]

Pika art.https://pika.art/, 2022

Pika Labs. Pika art.https://pika.art/, 2022. 6

work page 2022

[24] [24]

The tug-of-war be- tween deepfake generation and detection.arXiv preprint arXiv:2407.06174, 2024

Hannah Lee, Changyeon Lee, Kevin Farhat, Lin Qiu, Steve Geluso, Aerin Kim, and Oren Etzioni. The tug-of-war be- tween deepfake generation and detection.arXiv preprint arXiv:2407.06174, 2024. 1

work page arXiv 2024

[25] [25]

Fakebench: Probing ex- plainable fake image detection via large multimodal models.arXiv preprint arXiv:2404.13306, 2024

Yixuan Li, Xuelin Liu, Xiaoyang Wang, Bu Sung Lee, Shiqi Wang, Anderson Rocha, and Weisi Lin. Fakebench: Probing explainable fake image detection via large multimodal models. arXiv preprint arXiv:2404.13306, 2024. 1

work page arXiv 2024

[26] [26]

OmniHuman-1: Rethinking the scaling-up of one-stage conditioned human animation models

Gaojie Lin, Jianwen Jiang, Jiaqi Y ang, Zerong Zheng, and Chao Liang. Omnihuman-1: Rethinking the scaling-up of one- stage conditioned human animation models.arXiv preprint arXiv:2502.01061, 2025. 1

work page arXiv 2025

[27] [27]

Ts2-net: Token shift and selection transformer for text-video retrieval

Y uqi Liu, Pengfei Xiong, Luhui Xu, Shengming Cao, and Qin Jin. Ts2-net: Token shift and selection transformer for text-video retrieval. InEuropean conference on computer vision, pages 319–335. Springer, 2022. 6 9

work page 2022

[28] [28]

Domainforensics: Exposing face forgery across domains via bi-directional adaptation.IEEE Trans- actions on Information F orensics and Security, 2024

Qingxuan Lv, Y uezun Li, Junyu Dong, Sheng Chen, Hui Y u, Huiyu Zhou, and Shu Zhang. Domainforensics: Exposing face forgery across domains via bi-directional adaptation.IEEE Trans- actions on Information F orensics and Security, 2024. 2, 4

work page 2024

[29] [29]

Hotshot- xl

John Mullan, Duncan Crawbuck, and Aakash Sastry. Hotshot- xl. https://github.com/hotshotco/hotshot-xl,

work page

[30] [30]

Towards universal fake image detectors that generalize across generative models

Utkarsh Ojha, Y uheng Li, and Y ong Jae Lee. Towards universal fake image detectors that generalize across generative models. In Proceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, pages 24480–24489, 2023. 3, 6

work page 2023

[31] [31]

Sora by openai.https://openai.com/sora/,

OpenAI. Sora by openai.https://openai.com/sora/,

work page

[32] [32]

Xai-based detection of adversarial attacks on deepfake detectors.arXiv preprint arXiv:2403.02955, 2024

Ben Pinhasov, Raz Lapid, Rony Ohayon, Moshe Sipper, and Y ehudit Aperstein. Xai-based detection of adversarial attacks on deepfake detectors.arXiv preprint arXiv:2403.02955, 2024. 2

work page arXiv 2024

[33] [33]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution im- age synthesis.arXiv preprint arXiv:2307.01952, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Thinking in frequency: Face forgery detection by mining frequency-aware clues

Y uyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. InEuropean conference on computer vision, pages 86–103. Springer, 2020. 6

work page 2020

[35] [35]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[36] [36]

Ai-generated video purports to show apocalyptic scenes of los angeles wildfires, 2025

Reuters. Ai-generated video purports to show apocalyptic scenes of los angeles wildfires, 2025. 1

work page 2025

[37] [37]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 10684– 10695, 2022. 2

work page 2022

[38] [38]

Photore- alistic text-to-image diffusion models with deep language under- standing.Advances in neural information processing systems, 35: 36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gon- tijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photore- alistic text-to-image diffusion models with deep language under- standing.Advances in neural information processing systems, 35: 36479–36494, 2022. 2

work page 2022

[39] [39]

Facenet: A unified embedding for face recognition and clustering

Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015. 4

work page 2015

[40] [40]

De-fake: Detection and attribution of fake images generated by text-to- image generation models

Zeyang Sha, Zheng Li, Ning Y u, and Y ang Zhang. De-fake: Detection and attribution of fake images generated by text-to- image generation models. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 3418–3432, 2023. 6

work page 2023

[41] [41]

Generative ai and intellectual property rights

Jan Smits and Tijn Borghuis. Generative ai and intellectual property rights. InLaw and artificial intelligence: Regulating AI and applying AI in legal practice, pages 323–344. Springer, 2022. 1

work page 2022

[42] [42]

On learning multi-modal forgery representation for diffusion generated video detection.The Thirty-eighth Annual Conference on Neural Infor- mation Processing Systems, 2024

Xiufeng Song, Xiao Guo, Jiache Zhang, Qirui Li, Lei Bai, Xi- aoming Liu, Guangtao Zhai, and Xiaohong Liu. On learning multi-modal forgery representation for diffusion generated video detection.The Thirty-eighth Annual Conference on Neural Infor- mation Processing Systems, 2024. 2, 5, 6

work page 2024

[43] [43]

Morph studio

Morph Studio. Morph studio. https : / / www . morphstudio.com/, 2024. 1, 6, 7

work page 2024

[44] [44]

Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detec- tion

Chuangchuang Tan, Y ao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Y unchao Wei. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detec- tion. InProceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, pages 28130–28139, 2024. 3, 6

work page 2024

[45] [45]

C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection

Chuangchuang Tan, Renshuai Tao, Huan Liu, Guanghua Gu, Baoyuan Wu, Y ao Zhao, and Y unchao Wei. C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 7184–7192, 2025. 3

work page 2025

[46] [46]

Beyond deepfake images: Detecting ai- generated videos

Danial Samadi V ahdati, Tai D Nguyen, Aref Azizpour, and Matthew C Stamm. Beyond deepfake images: Detecting ai- generated videos. InProceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, pages 4397–4408,

work page

[47] [47]

Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

Laurens V an der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008. 6

work page 2008

[48] [48]

Attention is all you need.NeurIPS,

A V aswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A Gomez, Ł Kaiser, and I Polosukhin. Attention is all you need.NeurIPS,

work page

[49] [49]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Y uan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

Cnn-generated images are surprisingly easy to spot

Sheng-Y u Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are surprisingly easy to spot... for now. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8695–8704,

work page

[51] [51]

Dire for diffusion- generated image detection

Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion- generated image detection. InProceedings of the IEEE/CVF In- ternational Conference on Computer V ision, pages 22445–22455,

work page

[52] [52]

Where did i come from? origin attribution of ai-generated images.Advances in neural information processing systems, 36: 74478–74500, 2023

Zhenting Wang, Chen Chen, Yi Zeng, Lingjuan Lyu, and Shiqing Ma. Where did i come from? origin attribution of ai-generated images.Advances in neural information processing systems, 36: 74478–74500, 2023. 1, 2, 3

work page 2023

[53] [53]

Dynamicrafter: Animating open-domain im- ages with video diffusion priors

Jinbo Xing, Menghan Xia, Y ong Zhang, Haoxin Chen, Wangbo Y u, Hanyuan Liu, Gongye Liu, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Dynamicrafter: Animating open-domain im- ages with video diffusion priors. InEuropean Conference on Computer V ision, pages 399–417. Springer, 2024. 1

work page 2024

[54] [54]

Tall: Thumbnail layout for deepfake video detection

Y uting Xu, Jian Liang, Gengyun Jia, Ziming Y ang, Y anhao Zhang, and Ran He. Tall: Thumbnail layout for deepfake video detection. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 22658–22668, 2023. 1, 6

work page 2023

[55] [55]

Improved em- beddings with easy positive triplet mining

Hong Xuan, Abby Stylianou, and Robert Pless. Improved em- beddings with easy positive triplet mining. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer V ision, pages 2474–2482, 2020. 4

work page 2020

[56] [56]

Deepfake network architecture attribution

Tianyun Y ang, Ziyao Huang, Juan Cao, Lei Li, and Xirong Li. Deepfake network architecture attribution. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4662–4670,

work page

[57] [57]

Progressive open space expansion for open-set model attribution

Tianyun Y ang, Danding Wang, Fan Tang, Xinying Zhao, Juan Cao, and Sheng Tang. Progressive open space expansion for open-set model attribution. InProceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, pages 15856–15865, 2023. 3

work page 2023

[58] [58]

Show-1: Marrying pixel and latent diffusion models for text- to-video generation.International Journal of Computer V ision, pages 1–15, 2024

David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Y uchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text- to-video generation.International Journal of Computer V ision, pages 1–15, 2024. 7

work page 2024

[59] [59]

Open-sora: Democratizing efficient video production for all, 2024

Zangwei Zheng, Xiangyu Peng, Tianji Y ang, Chenhui Shen, Shenggui Li, Hongxin Liu, Y ukun Zhou, Tianyi Li, and Y ang Y ou. Open-sora: Democratizing efficient video production for all, 2024. 1

work page 2024

[60] [60]

Adversarial attacks and defenses in deep learning: From a perspective of cybersecurity.ACM Computing Surveys, 55(8):1–39, 2022

Shuai Zhou, Chi Liu, Dayong Y e, Tianqing Zhu, Wanlei Zhou, and Philip S Y u. Adversarial attacks and defenses in deep learning: From a perspective of cybersecurity.ACM Computing Surveys, 55(8):1–39, 2022. 2

work page 2022

[61] [61]

Unpaired image-to-image translation using cycle-consistent ad- versarial networks

Jun-Y an Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent ad- versarial networks. InProceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017. 2 11

work page 2017