ReConFuse: Reconstruction-Error Guided Semantic Fusion for AI-Generated Video Detection

(2) Ant Group; (3) Hefei University of Technology); Changtao Miao (2); Xiaojing Chen (1); Xinyu Lu (1); Yunfeng Diao (3) ((1) Anhui University

arxiv: 2606.04706 · v1 · pith:7D52HLKEnew · submitted 2026-06-03 · 💻 cs.CV

ReConFuse: Reconstruction-Error Guided Semantic Fusion for AI-Generated Video Detection

Xiaojing Chen (1) , Xinyu Lu (1) , Changtao Miao (2) , Yunfeng Diao (3) ((1) Anhui University , (2) Ant Group , (3) Hefei University of Technology) This is my paper

Pith reviewed 2026-06-28 07:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords AI-generated video detectionreconstruction errorsemantic fusionMamba temporal modelingvideo forensicsgeneralizationVAEforensic cue

0 comments

The pith

Reconstruction errors from a pretrained VAE, fused with semantic features and temporal modeling, detect AI-generated videos with strong generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that real and generated videos produce distinguishable frame-wise reconstruction error patterns when passed through a fixed pretrained WF-VAE. These patterns act as a forensic signal that reveals distributional differences not easily erased by evolving generators. The proposed ReConFuse method aligns the error cues with multi-frame semantic features and applies a Mamba-based module to capture their temporal evolution for video-level decisions. This matters because reliable detection is needed to address misinformation risks from increasingly realistic synthetic videos. If the approach holds, it supplies a practical cue that improves performance across generators without retraining the underlying VAE.

Core claim

By reconstructing input videos with a pretrained WF-VAE, real and generated videos exhibit distinguishable frame-wise reconstruction error patterns. ReConFuse extracts reconstruction error cues from WF-VAE reconstructed videos, aligns them with multi-frame semantic features, and uses a Mamba-based module to model temporal evolution for video-level classification.

What carries the argument

ReConFuse framework that extracts reconstruction error cues from a fixed pretrained WF-VAE, aligns those cues with semantic features, and applies Mamba-based temporal modeling for classification.

Load-bearing premise

The frame-wise reconstruction errors produced by the fixed pretrained WF-VAE reliably expose distributional differences between real and generated videos that remain useful after semantic alignment and temporal modeling.

What would settle it

A new generator set where ReConFuse accuracy matches or falls below a semantic-only baseline that discards all reconstruction-error input.

Figures

Figures reproduced from arXiv: 2606.04706 by (2) Ant Group, (3) Hefei University of Technology), Changtao Miao (2), Xiaojing Chen (1), Xinyu Lu (1), Yunfeng Diao (3) ((1) Anhui University.

**Figure 1.** Figure 1: Reconstruction behavior comparison between real and AI-generated videos (generated by Zhipu Qingyan [43]). The original frames, reconstructed frames (reconstructed by WF-VAE), and pseudo-color map (Purple denotes low reconstruction error, whereas yellow denotes high reconstruction error). of reconstruction errors illustrate that reconstruction error can provide discriminative low-level forensic evidence … view at source ↗

**Figure 2.** Figure 2: Overview of the proposed ReConFuse framework. The input video is reconstructed by a pretrained video VAE, reconstruction errors are extracted as low-level forensic evidence, reconstruction error tokens are aligned and fused with semantic tokens, and Mamba is used for temporal modeling and video-level real/fake prediction. This process can be viewed as projecting an input video into the learned video late… view at source ↗

**Figure 3.** Figure 3: Comparison of grayscale reconstruction-error maps produced by different VAE reconstruction priors. This comparison supports the use of a video reconstruction prior for AI-generated video detection. the generation process. Based on this comparison, we adopt WF-VAE as the reconstruction prior in ReConFuse, since it provides dense frame-wise reconstruction discrepancies while maintaining temporal consistency… view at source ↗

read the original abstract

AI-generated videos are becoming increasingly realistic, raising serious concerns about misinformation, content authenticity, and media trust. Reliable AI-generated video detection is therefore essential for multimedia forensics, yet remains challenging due to the need to capture spatial artifacts, temporal dynamics, and generalize to evolving generative models. In this paper, we explore reconstruction error as a discriminative forensic cue for AI-generated video detection. By reconstructing input videos with a pretrained WF-VAE, we observe that real and generated videos exhibit distinguishable frame-wise reconstruction error patterns, suggesting that reconstruction errors can reveal their distributional discrepancies. However, extending reconstruction-based image detection to videos is non-trivial, since video reconstruction errors are temporally organized across frames and require semantic context for effective interpretation. To address these challenges, we propose ReConFuse, a reconstruction-guided semantic fusion framework for video-level AI-generated video detection. ReConFuse extracts reconstruction error cues from WF-VAE reconstructed videos, aligns them with multi-frame semantic features, and uses a Mamba-based module to model temporal evolution for video-level classification. Experiments across multiple generators and evaluation settings demonstrate the effectiveness and strong generalization ability of ReConFuse.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReConFuse combines WF-VAE reconstruction errors with semantic features and Mamba temporal modeling for video detection, but the abstract supplies no numbers or baselines so the effectiveness claim stays untested.

read the letter

The paper's central move is to treat frame-wise reconstruction errors from a fixed pretrained WF-VAE as a forensic signal, align those errors with multi-frame semantic features, and feed the result into a Mamba block for video-level classification. That specific pipeline is what they present as new.

The write-up does a clear job stating why extending image-level reconstruction detectors to video is not straightforward: the errors are temporally ordered and need semantic context to be interpreted. The motivation section on distributional differences between real and generated videos is straightforward and matches what one would expect from a reconstruction-based approach.

The main limitation is that the abstract asserts effectiveness and strong generalization across generators but gives no quantitative results, no baseline comparisons, no ablation numbers, and no error analysis. Without those, the soundness of the central claim cannot be checked. The key assumption—that the reconstruction errors stay informative after semantic alignment and temporal modeling—remains plausible on paper but unverified in the supplied text.

This work is aimed at researchers already working on reconstruction cues or temporal modeling for deepfake detection. A reader who wants to see how Mamba is applied to error maps might find the architecture description useful once the full experiments are available.

I would send it to peer review so the empirical section can be examined directly rather than desk-rejecting on the basis of the abstract alone.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes ReConFuse, a reconstruction-guided semantic fusion framework for AI-generated video detection. It extracts frame-wise reconstruction error cues using a fixed pretrained WF-VAE, aligns these errors with multi-frame semantic features, and applies a Mamba-based temporal modeling module to produce video-level classifications. The central claim is that experiments across multiple generators and evaluation settings demonstrate the method's effectiveness and strong generalization ability.

Significance. If the empirical claims are substantiated with quantitative evidence, the approach could introduce a useful forensic cue based on reconstruction discrepancies that captures distributional differences between real and generated videos, potentially aiding generalization to unseen generators in multimedia forensics.

major comments (1)

[Abstract] Abstract: The assertion that 'experiments across multiple generators and evaluation settings demonstrate the effectiveness and strong generalization ability of ReConFuse' is presented without any quantitative results, baselines, ablation studies, dataset descriptions, or error analysis. This is load-bearing for the paper's central empirical contribution, as the soundness of the method cannot be assessed from the provided text.

minor comments (2)

The description of how reconstruction errors are aligned with semantic features and fed into the Mamba module would benefit from explicit equations or pseudocode to clarify the fusion process.
Pretraining details and architecture of the WF-VAE (e.g., whether it remains completely fixed during inference) should be stated more precisely to allow reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive comment. We address the concern regarding the abstract below.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that 'experiments across multiple generators and evaluation settings demonstrate the effectiveness and strong generalization ability of ReConFuse' is presented without any quantitative results, baselines, ablation studies, dataset descriptions, or error analysis. This is load-bearing for the paper's central empirical contribution, as the soundness of the method cannot be assessed from the provided text.

Authors: We acknowledge that the abstract, as currently written, summarizes the empirical claims at a high level without including specific quantitative metrics. The full manuscript (Sections 4 and 5) contains the requested details: quantitative results across multiple generators (e.g., accuracy, AUC on held-out generators), comparisons to baselines, ablation studies on the fusion and Mamba components, dataset descriptions, and error analysis. To address the referee's point directly, we will revise the abstract to incorporate key quantitative highlights (e.g., performance gains and generalization metrics) while remaining within length limits. This revision will make the central claim more self-contained. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes a pipeline that extracts frame-wise reconstruction errors from a fixed pretrained WF-VAE, aligns those errors with multi-frame semantic features, and applies a Mamba-based temporal module for video-level classification. No equations, fitting procedures, or derivation steps are presented that would reduce any claimed result to its own inputs by construction. The central claims rest on empirical experiments across generators rather than self-definitional mappings, fitted inputs renamed as predictions, or load-bearing self-citations. The approach is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.1-grok · 5766 in / 1011 out tokens · 20098 ms · 2026-06-28T07:11:51.725305+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 5 canonical work pages

[1]

In: IEEE International Workshop on Information Foren- sics and Security

Afchar, D., Nozick, V., Yamagishi, J., Echizen, I.: Mesonet: A compact facial video forgery detection network. In: IEEE International Workshop on Information Foren- sics and Security. pp. 1–7 (2018)

2018
[2]

arXiv preprint arXiv:1803.01271 (2018)

Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018)

Pith/arXiv arXiv 2018
[3]

Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: International Conference on Machine Learning. pp. 813– 824 (2021)

2021
[4]

In: arXiv preprint arXiv:2311.15127 (2023)

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., Jampani, V., Rombach, R.: Stable video diffusion: Scaling latent video diffusion models to large datasets. In: arXiv preprint arXiv:2311.15127 (2023)

Pith/arXiv arXiv 2023
[5]

Chang, C., Liu, Z., Lyu, X., Qi, X.: What matters in detecting ai-generated videos like Sora? arXiv preprint arXiv:2406.19568 (2024)

arXiv 2024
[6]

In: arXiv preprint arXiv:2405.19707 (2024)

Chen, Y., Li, J., Zhang, X., Liu, H., Wang, W., Li, W.: Demamba: Ai- generated video detection on million-scale genvideo benchmark. In: arXiv preprint arXiv:2405.19707 (2024)

arXiv 2024
[7]

In: Fifth Message Un- derstanding Conference (1993)

Chinchor, N., Sundheim, B.M.: MUC-5 evaluation metrics. In: Fifth Message Un- derstanding Conference (1993)

1993
[8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Chu, B., Xu, X., Wang, X., Zhang, Y., You, W., Zhou, L.: FIRE: Robust de- tection of diffusion-generated images via frequency-guided reconstruction error. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12830–12839 (2025)

2025
[9]

Data Intelligence 6(4), 968–993 (2024)

Dai, X., Yu, Z., Hiang, C., Gao, C., He, Q., Wu, D., Xu, Z.: Detecting novel malware classes with a foundational multi-modality data analysis model. Data Intelligence 6(4), 968–993 (2024). https://doi.org/10.3724/2096-7004.di.2024.0056

work page doi:10.3724/2096-7004.di.2024.0056 2024
[10]

In: arXiv preprint arXiv:2006.07397 (2020) ReConFuse for AI-Generated Video Detection 13

Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang, M., Ferrer, C.C.: The deepfake detection challenge dataset. In: arXiv preprint arXiv:2006.07397 (2020) ReConFuse for AI-Generated Video Detection 13

Pith/arXiv arXiv 2006
[11]

Data Intelli- gence7(4), 1169–1191 (2025)

Gao, X., Chen, W., Cui, Y., Dai, X., Dai, L.: Progressive adversarial contrastive learning: Towards efficient data augmentation in adversarial defense. Data Intelli- gence7(4), 1169–1191 (2025). https://doi.org/10.3724/2096-7004.di.2025.0190

work page doi:10.3724/2096-7004.di.2025.0190 2025
[12]

In: International Conference on Learning Representations (2024)

Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. In: International Conference on Learning Representations (2024)

2024
[13]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016)

2016
[14]

In: arXiv preprint arXiv:2210.02303 (2022)

Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., Salimans, T.: Imagen video: High definition video generation with diffusion models. In: arXiv preprint arXiv:2210.02303 (2022)

Pith/arXiv arXiv 2022
[15]

Neural Computation 9(8), 1735–1780 (1997)

Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997)

1997
[16]

IEEE Transactions on Knowledge and Data Engineering17(3), 299–310 (2005)

Huang, J., Ling, C.X.: Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on Knowledge and Data Engineering17(3), 299–310 (2005)

2005
[17]

arXiv preprint arXiv:1705.06950 (2017)

Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

Pith/arXiv arXiv 2017
[18]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, L., Bao, J., Zhang, T., Yang, H., Chen, D., Wen, F., Guo, B.: Face x-ray for more general face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5001–5010 (2020)

2020
[19]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, Y., Yang, X., Sun, P., Qi, H., Lyu, S.: Celeb-df: A large-scale challenging dataset for deepfake forensics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3207–3216 (2020)

2020
[20]

In: CVPR

Li, Z., Lin, B., Ye, Y., Chen, L., Cheng, X., Yuan, S., Yuan, L.: WF-VAE: Enhanc- ing video VAE by wavelet-driven energy flow for latent video diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition. pp. 17778–17788 (2025). https://doi.org/10.1109/CVPR52734.2025.01656

work page doi:10.1109/cvpr52734.2025.01656 2025
[21]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H.: Video swin trans- former. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3202–3211 (2022)

2022
[22]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

Luo, Y., Du, J., Yan, K., Ding, S.: LaRE 2: Latent reconstruction error based method for diffusion-generated image detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

2024
[23]

In: European Conference on Computer Vision

Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S., Ling, H.: Expanding language-image pretrained models for general video recognition. In: European Conference on Computer Vision. pp. 1–18 (2022)

2022
[24]

In: Pro- ceedings of the AAAI Conference on Artificial Intelligence (2026)

Ni, Z., Yan, Q., Huang, M., Yuan, T., Tang, Y., Hu, H., Chen, X., Wang, Y.: GenVidBench: A challenging benchmark for detecting ai-generated video. In: Pro- ceedings of the AAAI Conference on Artificial Intelligence (2026)

2026
[25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Ojha, U., Li, Y., Lee, Y.J.: Towards universal fake image detectors that gener- alize across generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 24480–24489 (2023)

2023
[26]

OpenAI Technical Report (2024), https://openai.com/research/video-generation-models-as-world-simulators

OpenAI: Video generation models as world simulators. OpenAI Technical Report (2024), https://openai.com/research/video-generation-models-as-world-simulators

2024
[28]

arXiv preprint arXiv:2402.13126 (2024) 14 X

Pang, Y., Zhang, Y., Wang, T.: VGMShield: Mitigating misuse of video generative models. arXiv preprint arXiv:2402.13126 (2024) 14 X. Chen et al

arXiv 2024
[29]

Data Intelligence7(2), 358– 380 (2025)

Qin, Y., Xie, H., Li, Y., Tan, B., Ding, S.: Enhancing intermodal interaction for unified vision-language understanding and generation. Data Intelligence7(2), 358– 380 (2025). https://doi.org/10.3724/2096-7004.di.2025.0034

work page doi:10.3724/2096-7004.di.2025.0034 2025
[30]

In: International Conference on Machine Learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. pp. 8748–8763 (2021)

2021
[31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Ricker, J., Lukovnikov, D., Fischer, A.: Aeroblade: Training-free detection of latent diffusion images using autoencoder reconstruction error. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9130– 9140 (2024)

2024
[32]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10684–10695 (2022)

2022
[33]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: Face- forensics++: Learning to detect manipulated facial images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1–11 (2019)

2019
[34]

In: International Conference on Learning Representations (2023)

Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., Parikh, D., Gupta, S., Taigman, Y.: Make-a-video: Text-to- video generation without text-video data. In: International Conference on Learning Representations (2023)

2023
[35]

In: Advances in Neural Information Processing Systems (2024)

Song, X., Guo, X., Zhang, J., Li, Q., Bai, L., Liu, X., Zhai, G., Liu, X.: On learn- ing multi-modal forgery representation for diffusion generated video detection. In: Advances in Neural Information Processing Systems (2024)

2024
[36]

In: Advances in Neural Information Processing Systems

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems. pp. 5998–6008 (2017)

2017
[37]

In: International Conference on Learning Representations (2023)

Villegas, R., Babaeizadeh, M., Kindermans, P.J., Moraldo, H., Zhang, H., Saffar, M., Castro, S., Kunze, J., Erhan, D.: Phenaki: Variable length video generation from open domain textual description. In: International Conference on Learning Representations (2023)

2023
[38]

In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision

Wang, Z., Bao, J., Zhou, W., Wang, W., Hu, H., Chen, H., Li, H.: Dire for diffusion- generated image detection. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision. pp. 22445–22455 (2023)

2023
[39]

arXiv preprint arXiv:2505.12620 (2025)

Wen, H., He, Y., Huang, Z., Li, T., Yu, Z., Huang, X., Qi, L., Wu, B., Li, X., Cheng, G.: Busterx: Mllm-powered ai-generated video forgery detection and explanation. arXiv preprint arXiv:2505.12620 (2025)

Pith/arXiv arXiv 2025
[40]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: A large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5288–5296 (2016)

2016
[41]

In: Advances in Neural Information Processing Systems (2025)

Zhang, S., Lian, Z., Yang, J., Li, D., Pang, G., Liu, F., Han, B., Li, S., Tan, M.: Physics-driven spatiotemporal modeling for ai-generated video detection. In: Advances in Neural Information Processing Systems (2025)

2025
[42]

In: arXiv preprint arXiv:2508.00701 (2025)

Zheng, C., Suo, R., Lin, C., Zhao, Z., Yang, L., Liu, S., Yang, M., Wang, C., Shen, C.: D3: Training-free ai-generated video detection using second-order features. In: arXiv preprint arXiv:2508.00701 (2025)

arXiv 2025
[43]

https://chatglm.cn/ (2026), generative AI assistant

Zhipu AI: Zhipu qingyan. https://chatglm.cn/ (2026), generative AI assistant. Ac- cessed: 2026-05-28

2026
[44]

Data Intelligence7(2), 461–484 (2025)

Zhu, Y., Li, Y., Wang, J., Gao, M., Wei, J.: FaKnow: A unified library for fake news detection. Data Intelligence7(2), 461–484 (2025). https://doi.org/10.3724/2096- 7004.di.2024.0026

work page doi:10.3724/2096- 2025

[1] [1]

In: IEEE International Workshop on Information Foren- sics and Security

Afchar, D., Nozick, V., Yamagishi, J., Echizen, I.: Mesonet: A compact facial video forgery detection network. In: IEEE International Workshop on Information Foren- sics and Security. pp. 1–7 (2018)

2018

[2] [2]

arXiv preprint arXiv:1803.01271 (2018)

Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018)

Pith/arXiv arXiv 2018

[3] [3]

Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: International Conference on Machine Learning. pp. 813– 824 (2021)

2021

[4] [4]

In: arXiv preprint arXiv:2311.15127 (2023)

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., Jampani, V., Rombach, R.: Stable video diffusion: Scaling latent video diffusion models to large datasets. In: arXiv preprint arXiv:2311.15127 (2023)

Pith/arXiv arXiv 2023

[5] [5]

Chang, C., Liu, Z., Lyu, X., Qi, X.: What matters in detecting ai-generated videos like Sora? arXiv preprint arXiv:2406.19568 (2024)

arXiv 2024

[6] [6]

In: arXiv preprint arXiv:2405.19707 (2024)

Chen, Y., Li, J., Zhang, X., Liu, H., Wang, W., Li, W.: Demamba: Ai- generated video detection on million-scale genvideo benchmark. In: arXiv preprint arXiv:2405.19707 (2024)

arXiv 2024

[7] [7]

In: Fifth Message Un- derstanding Conference (1993)

Chinchor, N., Sundheim, B.M.: MUC-5 evaluation metrics. In: Fifth Message Un- derstanding Conference (1993)

1993

[8] [8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Chu, B., Xu, X., Wang, X., Zhang, Y., You, W., Zhou, L.: FIRE: Robust de- tection of diffusion-generated images via frequency-guided reconstruction error. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12830–12839 (2025)

2025

[9] [9]

Data Intelligence 6(4), 968–993 (2024)

Dai, X., Yu, Z., Hiang, C., Gao, C., He, Q., Wu, D., Xu, Z.: Detecting novel malware classes with a foundational multi-modality data analysis model. Data Intelligence 6(4), 968–993 (2024). https://doi.org/10.3724/2096-7004.di.2024.0056

work page doi:10.3724/2096-7004.di.2024.0056 2024

[10] [10]

In: arXiv preprint arXiv:2006.07397 (2020) ReConFuse for AI-Generated Video Detection 13

Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang, M., Ferrer, C.C.: The deepfake detection challenge dataset. In: arXiv preprint arXiv:2006.07397 (2020) ReConFuse for AI-Generated Video Detection 13

Pith/arXiv arXiv 2006

[11] [11]

Data Intelli- gence7(4), 1169–1191 (2025)

Gao, X., Chen, W., Cui, Y., Dai, X., Dai, L.: Progressive adversarial contrastive learning: Towards efficient data augmentation in adversarial defense. Data Intelli- gence7(4), 1169–1191 (2025). https://doi.org/10.3724/2096-7004.di.2025.0190

work page doi:10.3724/2096-7004.di.2025.0190 2025

[12] [12]

In: International Conference on Learning Representations (2024)

Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. In: International Conference on Learning Representations (2024)

2024

[13] [13]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016)

2016

[14] [14]

In: arXiv preprint arXiv:2210.02303 (2022)

Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., Salimans, T.: Imagen video: High definition video generation with diffusion models. In: arXiv preprint arXiv:2210.02303 (2022)

Pith/arXiv arXiv 2022

[15] [15]

Neural Computation 9(8), 1735–1780 (1997)

Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997)

1997

[16] [16]

IEEE Transactions on Knowledge and Data Engineering17(3), 299–310 (2005)

Huang, J., Ling, C.X.: Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on Knowledge and Data Engineering17(3), 299–310 (2005)

2005

[17] [17]

arXiv preprint arXiv:1705.06950 (2017)

Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

Pith/arXiv arXiv 2017

[18] [18]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, L., Bao, J., Zhang, T., Yang, H., Chen, D., Wen, F., Guo, B.: Face x-ray for more general face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5001–5010 (2020)

2020

[19] [19]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, Y., Yang, X., Sun, P., Qi, H., Lyu, S.: Celeb-df: A large-scale challenging dataset for deepfake forensics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3207–3216 (2020)

2020

[20] [20]

In: CVPR

Li, Z., Lin, B., Ye, Y., Chen, L., Cheng, X., Yuan, S., Yuan, L.: WF-VAE: Enhanc- ing video VAE by wavelet-driven energy flow for latent video diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition. pp. 17778–17788 (2025). https://doi.org/10.1109/CVPR52734.2025.01656

work page doi:10.1109/cvpr52734.2025.01656 2025

[21] [21]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H.: Video swin trans- former. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3202–3211 (2022)

2022

[22] [22]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

Luo, Y., Du, J., Yan, K., Ding, S.: LaRE 2: Latent reconstruction error based method for diffusion-generated image detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

2024

[23] [23]

In: European Conference on Computer Vision

Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S., Ling, H.: Expanding language-image pretrained models for general video recognition. In: European Conference on Computer Vision. pp. 1–18 (2022)

2022

[24] [24]

In: Pro- ceedings of the AAAI Conference on Artificial Intelligence (2026)

Ni, Z., Yan, Q., Huang, M., Yuan, T., Tang, Y., Hu, H., Chen, X., Wang, Y.: GenVidBench: A challenging benchmark for detecting ai-generated video. In: Pro- ceedings of the AAAI Conference on Artificial Intelligence (2026)

2026

[25] [25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Ojha, U., Li, Y., Lee, Y.J.: Towards universal fake image detectors that gener- alize across generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 24480–24489 (2023)

2023

[26] [26]

OpenAI Technical Report (2024), https://openai.com/research/video-generation-models-as-world-simulators

OpenAI: Video generation models as world simulators. OpenAI Technical Report (2024), https://openai.com/research/video-generation-models-as-world-simulators

2024

[27] [28]

arXiv preprint arXiv:2402.13126 (2024) 14 X

Pang, Y., Zhang, Y., Wang, T.: VGMShield: Mitigating misuse of video generative models. arXiv preprint arXiv:2402.13126 (2024) 14 X. Chen et al

arXiv 2024

[28] [29]

Data Intelligence7(2), 358– 380 (2025)

Qin, Y., Xie, H., Li, Y., Tan, B., Ding, S.: Enhancing intermodal interaction for unified vision-language understanding and generation. Data Intelligence7(2), 358– 380 (2025). https://doi.org/10.3724/2096-7004.di.2025.0034

work page doi:10.3724/2096-7004.di.2025.0034 2025

[29] [30]

In: International Conference on Machine Learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. pp. 8748–8763 (2021)

2021

[30] [31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Ricker, J., Lukovnikov, D., Fischer, A.: Aeroblade: Training-free detection of latent diffusion images using autoencoder reconstruction error. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9130– 9140 (2024)

2024

[31] [32]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10684–10695 (2022)

2022

[32] [33]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: Face- forensics++: Learning to detect manipulated facial images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1–11 (2019)

2019

[33] [34]

In: International Conference on Learning Representations (2023)

Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., Parikh, D., Gupta, S., Taigman, Y.: Make-a-video: Text-to- video generation without text-video data. In: International Conference on Learning Representations (2023)

2023

[34] [35]

In: Advances in Neural Information Processing Systems (2024)

Song, X., Guo, X., Zhang, J., Li, Q., Bai, L., Liu, X., Zhai, G., Liu, X.: On learn- ing multi-modal forgery representation for diffusion generated video detection. In: Advances in Neural Information Processing Systems (2024)

2024

[35] [36]

In: Advances in Neural Information Processing Systems

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems. pp. 5998–6008 (2017)

2017

[36] [37]

In: International Conference on Learning Representations (2023)

Villegas, R., Babaeizadeh, M., Kindermans, P.J., Moraldo, H., Zhang, H., Saffar, M., Castro, S., Kunze, J., Erhan, D.: Phenaki: Variable length video generation from open domain textual description. In: International Conference on Learning Representations (2023)

2023

[37] [38]

In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision

Wang, Z., Bao, J., Zhou, W., Wang, W., Hu, H., Chen, H., Li, H.: Dire for diffusion- generated image detection. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision. pp. 22445–22455 (2023)

2023

[38] [39]

arXiv preprint arXiv:2505.12620 (2025)

Wen, H., He, Y., Huang, Z., Li, T., Yu, Z., Huang, X., Qi, L., Wu, B., Li, X., Cheng, G.: Busterx: Mllm-powered ai-generated video forgery detection and explanation. arXiv preprint arXiv:2505.12620 (2025)

Pith/arXiv arXiv 2025

[39] [40]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: A large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5288–5296 (2016)

2016

[40] [41]

In: Advances in Neural Information Processing Systems (2025)

Zhang, S., Lian, Z., Yang, J., Li, D., Pang, G., Liu, F., Han, B., Li, S., Tan, M.: Physics-driven spatiotemporal modeling for ai-generated video detection. In: Advances in Neural Information Processing Systems (2025)

2025

[41] [42]

In: arXiv preprint arXiv:2508.00701 (2025)

Zheng, C., Suo, R., Lin, C., Zhao, Z., Yang, L., Liu, S., Yang, M., Wang, C., Shen, C.: D3: Training-free ai-generated video detection using second-order features. In: arXiv preprint arXiv:2508.00701 (2025)

arXiv 2025

[42] [43]

https://chatglm.cn/ (2026), generative AI assistant

Zhipu AI: Zhipu qingyan. https://chatglm.cn/ (2026), generative AI assistant. Ac- cessed: 2026-05-28

2026

[43] [44]

Data Intelligence7(2), 461–484 (2025)

Zhu, Y., Li, Y., Wang, J., Gao, M., Wei, J.: FaKnow: A unified library for fake news detection. Data Intelligence7(2), 461–484 (2025). https://doi.org/10.3724/2096- 7004.di.2024.0026

work page doi:10.3724/2096- 2025