Mining Forgery Traces from Reconstruction Error: A Weakly Supervised Framework for Multimodal Deepfake Temporal Localization

Midou Guo; Qilin Yin; Rui Yang; Wei Lu

arxiv: 2601.21458 · v2 · pith:KJG7JJTLnew · submitted 2026-01-29 · 💻 cs.CV

Mining Forgery Traces from Reconstruction Error: A Weakly Supervised Framework for Multimodal Deepfake Temporal Localization

Midou Guo , Qilin Yin , Wei Lu , Rui Yang This is my paper

Pith reviewed 2026-05-21 15:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords deepfake temporal localizationweakly supervised detectionreconstruction errormasked autoencodercontrastive lossforgery localizationmultimodal deepfake

0 comments

The pith

A masked autoencoder trained only on authentic videos produces higher reconstruction errors on forged segments, enabling accurate temporal localization of deepfakes using only video-level labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to localize intermittent and localized deepfake manipulations in videos without the expense of frame-by-frame human annotations. It trains a masked autoencoder exclusively on real videos so the model learns normal spatiotemporal patterns and then flags forged intervals through the resulting reconstruction discrepancies. A new asymmetric intra-video contrastive loss then uses those error signals to tighten authentic features and create a usable decision boundary. If the approach holds, it would let systems scan large video collections for tampering with far less labeling effort while remaining effective against new generative methods.

Core claim

The framework identifies forgeries by measuring reconstruction errors from a masked autoencoder trained exclusively on authentic data, which highlights forged segments through discrepancies in spatiotemporal patterns. The asymmetric intra-video contrastive loss then uses these error cues to enforce compactness in authentic features, creating a decision boundary that supports accurate localization even for unseen forgery methods.

What carries the argument

A masked autoencoder trained solely on authentic videos whose reconstruction errors act as forgery indicators, paired with an asymmetric intra-video contrastive loss that converts those errors into a stable separation between real and manipulated segments.

If this is right

Reconstruction errors supply fine-grained temporal cues without requiring dense frame-level annotations.
The contrastive loss creates a stable boundary that improves local discrimination inside each video.
The method remains effective against forgeries generated by advanced models absent from training.
Performance improves on large-scale video collections in weakly supervised settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same error-based cue could extend to spotting edits in audio tracks or still images.
Combining reconstruction signals with other weak labels might further cut annotation costs in multimedia verification.
Evaluating the approach on videos that mix short authentic and forged clips would test its behavior in realistic mixed-content scenarios.

Load-bearing premise

Reconstruction errors from the autoencoder will be reliably larger and distinguishable for forged video segments compared to authentic ones, even when the forgeries use unseen advanced techniques.

What would settle it

An experiment in which a new generative forgery method produces reconstruction errors of similar magnitude and distribution to those of authentic segments would disprove the core detection mechanism.

Figures

Figures reproduced from arXiv: 2601.21458 by Midou Guo, Qilin Yin, Rui Yang, Wei Lu.

**Figure 1.** Figure 1: Comparison of different temporal forgery localization tasks: (a) Fully supervised temporal forgery localization; (b) Mutimodal weakly supervised temporal forgery localization. et al., 2023; Yu et al., 2025) has become urgently needed. Early deepfake detection research primarily focused on global detection(Guo et al., 2025; Lv et al., 2024; Zhou et al., 2024), treating the problem as a binary classificatio… view at source ↗

**Figure 2.** Figure 2: (a) The overall workflow and data flow of the proposed framework. (b) The internal architecture of the core components within RT-DeepLoc, which includes the Multimodal Feature Encoding and Fusion module, the Forgery Discovery Network based on MAE, the Asymmetric Intra-video Contrastive Loss module, and the Multi-task Learning Reinforcement strategy. coder Edec. Each query embedding zi within this sequence … view at source ↗

**Figure 3.** Figure 3: Sensitivity analysis of hyperparameters on the LAV-DF dataset. (a) The effect of the number of selected frames K in the AICL module. (b) The effect of the masking ratio ρ in the FDN module. capturing semantic inconsistencies via attention. However, it still trails the full RT-DeepLoc by 10.61%. This gap confirms that semantic cues alone are insufficient and FDN provides indispensable reconstruction indica… view at source ↗

**Figure 4.** Figure 4: Qualitative visualization of modality-specific reconstruction discrepancies on LAV-DF. We present four scenarios: (a) audio-only, (b) multimodal, (c) visual-only forgeries, and (d) authentic video. Blue and green curves represent visual and audio reconstruction errors, respectively, while shaded areas indicate ground-truth intervals. supervision, effectively regularizes the dual-stream learning process, en… view at source ↗

read the original abstract

Modern deepfakes have evolved into localized and intermittent manipulations that require fine-grained temporal localization to mitigate severe digital security risks. The prohibitive cost of frame-level annotation makes weakly supervised methods a practical necessity, which rely only on video-level labels. To this end, we propose Reconstruction-based Temporal Deepfake Localization (RT-DeepLoc), a weakly supervised temporal forgery localization framework that identifies forgeries via reconstruction errors. Our framework uses a Masked Autoencoder (MAE) trained exclusively on authentic data to learn its intrinsic spatiotemporal patterns; this allows the model to produce significant reconstruction discrepancies for forged segments, effectively providing the missing fine-grained cues for accurate localization without demanding dense human annotations. To robustly leverage these indicators, we introduce a novel Asymmetric Intra-video Contrastive Loss (AICL). By focusing on the compactness of authentic features guided by these reconstruction cues, AICL establishes a stable decision boundary that enhances local discrimination while preserving generalization to unseen forgeries by advanced generative models. Extensive experiments on large-scale datasets, including LAV-DF, demonstrate that RT-DeepLoc achieves state-of-the-art performance in weakly-supervised temporal forgery localization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RT-DeepLoc mines forgery traces via reconstruction error from an authentic-only MAE plus a new asymmetric contrastive loss, but the abstract supplies no numbers or ablations to check if the core cue actually works.

read the letter

The main point is that this paper offers a weakly supervised way to localize intermittent deepfakes by training a Masked Autoencoder only on real videos so that forged segments produce larger reconstruction errors, then feeding those errors into a new Asymmetric Intra-video Contrastive Loss to sharpen the decision boundary under video-level labels only. The concrete framework that pairs MAE error mining with AICL for multimodal temporal localization is not a standard extension of prior deepfake work. It does address a genuine practical gap: dense frame annotations are too expensive for large archives, so any method that turns cheap video labels into usable temporal cues is worth examining. The reconstruction cue is a straightforward idea that could supply the missing fine-grained signal without extra human work, and the asymmetric loss is set up to keep real features tight while still allowing detection of anomalies from unseen generators. That combination is the clearest piece of novelty here. The soft spot is the missing evidence. The abstract states SOTA results on LAV-DF and similar sets but shows no tables, no baseline numbers, no ablations, and no characterization of how big the error gap actually is between real and forged segments. Without that data it is hard to judge whether the central assumption holds up, especially when the forgeries come from advanced models the MAE never saw. The stress-test concern about overlapping error distributions is reasonable at the level of the abstract; if the gap is small or noisy the whole pipeline loses its main source of supervision. This paper is for people working on scalable deepfake detection and video forensics who already know the weak-supervision literature. A reader who needs ideas for turning self-supervised reconstruction into anomaly cues might extract something usable. I would send it for peer review because the problem is real, the framing is direct, and a full version with proper experiments could be worth referee time even if the current draft needs more empirical grounding.

Referee Report

2 major / 1 minor

Summary. The paper proposes RT-DeepLoc, a weakly supervised framework for multimodal deepfake temporal localization. It trains a Masked Autoencoder exclusively on authentic videos to generate reconstruction errors that highlight forged segments, then uses these errors to supervise an Asymmetric Intra-video Contrastive Loss (AICL) for local discrimination. The authors claim this yields state-of-the-art performance on large-scale datasets including LAV-DF while generalizing to unseen generative models.

Significance. If the reconstruction-error cue proves reliable, the work would advance weakly-supervised temporal localization by supplying fine-grained supervision without frame-level annotations and by demonstrating generalization to advanced forgeries. The MAE-based cue and AICL constitute a novel combination that could influence subsequent annotation-efficient forgery detection pipelines.

major comments (2)

[Abstract] Abstract: the central SOTA performance claim is stated without any numerical results, tables, ablation studies, error bars, or baseline comparisons, so the primary empirical contribution cannot be assessed from the manuscript text.
[MAE training and error-based cue] MAE training and error-based cue (abstract paragraph): the load-bearing premise that reconstruction errors will be reliably and significantly larger on forged segments—even those produced by advanced generative models absent from training—is asserted but not supported by any quantitative characterization of error distributions, gap sizes, or an ablation that removes the reconstruction cue.

minor comments (1)

[Abstract] Abstract: the description of AICL could be expanded with a one-sentence statement of how the asymmetry is realized (e.g., weighting or margin terms).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to improve clarity and substantiation of claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central SOTA performance claim is stated without any numerical results, tables, ablation studies, error bars, or baseline comparisons, so the primary empirical contribution cannot be assessed from the manuscript text.

Authors: We agree that the abstract would be strengthened by including specific numerical results to support the SOTA claim. In the revised manuscript, we will update the abstract to report key metrics such as the mAP achieved on LAV-DF under video-level supervision, along with direct comparisons to recent baselines and a brief note on ablation outcomes. revision: yes
Referee: [MAE training and error-based cue] MAE training and error-based cue (abstract paragraph): the load-bearing premise that reconstruction errors will be reliably and significantly larger on forged segments—even those produced by advanced generative models absent from training—is asserted but not supported by any quantitative characterization of error distributions, gap sizes, or an ablation that removes the reconstruction cue.

Authors: The full manuscript contains quantitative characterization of reconstruction errors, including distribution plots, gap measurements between authentic and forged segments, and results for unseen generative models in the experimental analysis. An ablation isolating the contribution of the reconstruction cue is also reported. To address the concern that this support is not evident from the abstract, we will revise the relevant abstract paragraph to include a concise reference to these empirical observations and error gap sizes. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent training and empirical cue

full rationale

The paper's core chain trains an MAE exclusively on authentic videos to produce reconstruction errors, then feeds those errors into a novel Asymmetric Intra-video Contrastive Loss for weakly-supervised localization. This is not self-definitional, does not rename a fitted input as a prediction, and contains no load-bearing self-citations or uniqueness theorems imported from prior author work. The localization output is not equivalent to any input parameter by construction; it depends on the external assumption that reconstruction discrepancies will be larger on unseen forgeries, which is tested via experiments on LAV-DF rather than being tautological. The framework remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that authentic-only MAE training produces discriminative reconstruction errors for unseen forgeries; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Reconstruction error from an MAE trained exclusively on authentic videos reliably indicates locations of forged segments even for advanced unseen generative models
This assumption supplies the fine-grained cue that replaces frame-level labels and is invoked when the abstract describes how the model produces discrepancies for forged segments.

pith-pipeline@v0.9.0 · 5737 in / 1377 out tokens · 137423 ms · 2026-05-21T15:24:08.217436+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

[1]

MesoNet: a Compact Facial Video Forgery Detection Network.IEEE International Workshop on Information Forensics and Security, 2018

Afchar, D., Nozick, V ., Yamagishi, J., and Echizen, I. MesoNet: a Compact Facial Video Forgery Detection Network.IEEE International Workshop on Information Forensics and Security, 2018

work page 2018
[2]

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.Advances in Neural Information Processing Systems, 2020

Baevski, A., Zhou, Y ., Mohamed, A., and Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.Advances in Neural Information Processing Systems, 2020

work page 2020
[3]

Do You Re- ally Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization

Cai, Z., Stefanov, K., Dhall, A., and Hayat, M. Do You Re- ally Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization. InInternational Conference on Digital Image Computing: Techniques and Applications, 2022

work page 2022
[4]

Glitch in the matrix: A large scale benchmark for content driven audio–visual forgery detection and lo- calization

Hayat, M. Glitch in the matrix: A large scale benchmark for content driven audio–visual forgery detection and lo- calization. InComputer Vision and Image Understanding, 2023

work page 2023
[5]

A V-Deepfake1M: A Large- Scale LLM-Driven Audio-Visual Deepfake Dataset

Gedeon, T., and Stefanov, K. A V-Deepfake1M: A Large- Scale LLM-Driven Audio-Visual Deepfake Dataset. In Proceedings of the 32nd ACM International Conference on Multimedia, 2024

work page 2024
[6]

Full-Stage Pseudo Label Quality Enhancement for Weakly-Supervised Tem- poral Action Localization.IEEE Transactions on Circuits and Systems for Video Technology, 2025

Feng, Q., Li, W., Lin, T., and Chen, X. Full-Stage Pseudo Label Quality Enhancement for Weakly-Supervised Tem- poral Action Localization.IEEE Transactions on Circuits and Systems for Video Technology, 2025

work page 2025
[7]

Fine-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization

Gao, J., Chen, M., and Xu, C. Fine-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

work page 2022
[8]

Towards Open- world Generalized Deepfake Detection: General Feature Extraction via Unsupervised Domain Adaptation

Guo, M., Yin, Q., Lu, W., and Luo, X. Towards Open- world Generalized Deepfake Detection: General Feature Extraction via Unsupervised Domain Adaptation. InPro- ceedings of the 33rd ACM International Conference on Multimedia, 2025

work page 2025
[9]

Masked Autoencoders Are Scalable Vision Learners

He, K., Chen, X., Xie, S., Li, Y ., Doll´ar, P., and Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

work page 2022
[10]

Masked Au- toencoders that Listen.Advances in Neural Information Processing Systems, 2022

Galuba, W., Metze, F., and Feichtenhofer, C. Masked Au- toencoders that Listen.Advances in Neural Information Processing Systems, 2022

work page 2022
[11]

In Ictu Oculi: Exposing AI Created Fake Videos by Detecting Eye Blinking.IEEE International Workshop on Information Forensics and Security, 2018

Li, Y ., Chang, M.-C., and Lyu, S. In Ictu Oculi: Exposing AI Created Fake Videos by Detecting Eye Blinking.IEEE International Workshop on Information Forensics and Security, 2018

work page 2018
[12]

Multilevel semantic and adaptive actionness learning for weakly supervised tem- poral action localization.Neural Networks, 2025

Li, Z., Wang, Z., and Dong, C. Multilevel semantic and adaptive actionness learning for weakly supervised tem- poral action localization.Neural Networks, 2025

work page 2025
[13]

Audio-Visual Tem- poral Forgery Detection Using Embedding-Level Fusion and Multi-Dimensional Contrastive Loss.IEEE Trans- actions on Circuits and Systems for Video Technology, 2024

Liu, M., Wang, J., Qian, X., and Li, H. Audio-Visual Tem- poral Forgery Detection Using Embedding-Level Fusion and Multi-Dimensional Contrastive Loss.IEEE Trans- actions on Circuits and Systems for Video Technology, 2024

work page 2024
[14]

DomainForensics: Exposing Face Forgery Across Domains via Bi-Directional Adaptation.IEEE Transactions on Information Forensics and Security, 2024

Zhang, S. DomainForensics: Exposing Face Forgery Across Domains via Bi-Directional Adaptation.IEEE Transactions on Information Forensics and Security, 2024

work page 2024
[15]

DiR- Loc: Disentanglement Representation Learning for Ro- bust Image Forgery Localization.IEEE Transactions on Dependable and Secure Computing, 2024

Sheng, Z., Qu, Z., Lu, W., Cao, X., and Huang, J. DiR- Loc: Disentanglement Representation Learning for Ro- bust Image Forgery Localization.IEEE Transactions on Dependable and Secure Computing, 2024

work page 2024
[16]

SUMI- IFL: An Information-Theoretic Framework for Image Forgery Localization with Sufficiency and Minimality Constraints

Sheng, Z., Lu, W., Luo, X., Zhou, J., and Cao, X. SUMI- IFL: An Information-Theoretic Framework for Image Forgery Localization with Sufficiency and Minimality Constraints. InProceedings of the AAAI Conference on Artificial Intelligence, 2025. 9 Mining Forgery Traces from Reconstruction Error

work page 2025
[17]

TriDet: Temporal Action Detection with Relative Bound- ary Modeling

Shi, D., Zhong, Y ., Cao, Q., Ma, L., Li, J., and Tao, D. TriDet: Temporal Action Detection with Relative Bound- ary Modeling. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2023

work page 2023
[18]

VideoMae: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training.Advances in Neural Information Processing Systems, 2022

Tong, Z., Song, Y ., Wang, J., and Wang, L. VideoMae: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training.Advances in Neural Information Processing Systems, 2022

work page 2022
[19]

Temporal Action Localization in the Deep Learning Era: A Sur- vey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Wang, B., Zhao, Y ., Yang, L., Long, T., and Li, X. Temporal Action Localization in the Deep Learning Era: A Sur- vey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024
[20]

Weakly-Supervised Action Lo- calization by Hierarchically-structured Latent Attention Modeling

Wang, G., Zhao, P., Zhao, C., Yang, S., Cheng, J., Leng, L., Liao, J., and Guo, Q. Weakly-Supervised Action Lo- calization by Hierarchically-structured Latent Attention Modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

work page 2023
[21]

Temporal Segment Networks for Action Recognition in Videos.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018

Wang, L., Xiong, Y ., Wang, Z., Qiao, Y ., Lin, D., Tang, X., and Van Gool, L. Temporal Segment Networks for Action Recognition in Videos.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018

work page 2018
[22]

Weakly-supervised Audio Temporal Forgery Localization via Progressive Audio-language Co-learning Network

Wu, J., Xu, W., Lu, W., Luo, X., Yang, R., and Guo, S. Weakly-supervised Audio Temporal Forgery Localization via Progressive Audio-language Co-learning Network. In Proceedings of the International Joint Conference on Artificial Intelligence, 2025

work page 2025
[23]

Dynamic Difference Learning With Spatio–Temporal Correlation for Deep- fake Video Detection.IEEE Transactions on Information Forensics and Security, 2023

Yin, Q., Lu, W., Li, B., and Huang, J. Dynamic Difference Learning With Spatio–Temporal Correlation for Deep- fake Video Detection.IEEE Transactions on Information Forensics and Security, 2023

work page 2023
[24]

Yu, P., Fei, J., Gao, H., Feng, X., Xia, Z., and Chang, C. H. Unlocking the Capabilities of Large Vision-Language Models for Generalizable and Explainable Deepfake De- tection. InForty-second International Conference on Machine Learning, 2025

work page 2025
[25]

CoLA: Weakly-Supervised Temporal Action Localization with Snippet Contrastive Learning

Zhang, C., Cao, M., Yang, D., Chen, J., and Zou, Y . CoLA: Weakly-Supervised Temporal Action Localization with Snippet Contrastive Learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

work page 2021
[26]

ActionFormer: Localizing Moments of Actions with Transformers

Zhang, C.-L., Wu, J., and Li, Y . ActionFormer: Localizing Moments of Actions with Transformers. InEuropean Conference on Computer Vision, 2022

work page 2022
[27]

Acoustics, Speech and Signal Processing, 2025

work page 2025
[28]

UMMAFormer: A Universal Multimodal-adaptive Trans- former Framework for Temporal Forgery Localization

Zhang, R., Wang, H., Du, M., Liu, H., Zhou, Y ., and Zeng, Q. UMMAFormer: A Universal Multimodal-adaptive Trans- former Framework for Temporal Forgery Localization. In Proceedings of the 31st ACM International Conference on Multimedia, 2023

work page 2023
[29]

MFMS: Learning Modality-Fused and Modality-Specific Features for Deepfake Detection and Localization Tasks

Li, Z., Hu, B., Feng, W., Gong, T., and Chu, Q. MFMS: Learning Modality-Fused and Modality-Specific Features for Deepfake Detection and Localization Tasks. InPro- ceedings of the 32nd ACM International Conference on Multimedia, 2024

work page 2024
[30]

Fine-grained open-set deepfake detection via unsupervised domain adaptation.IEEE Transactions on Information Forensics and Security, 2024

Zhou, X., Han, H., Shan, S., and Chen, X. Fine-grained open-set deepfake detection via unsupervised domain adaptation.IEEE Transactions on Information Forensics and Security, 2024. 10

work page 2024

[1] [1]

MesoNet: a Compact Facial Video Forgery Detection Network.IEEE International Workshop on Information Forensics and Security, 2018

Afchar, D., Nozick, V ., Yamagishi, J., and Echizen, I. MesoNet: a Compact Facial Video Forgery Detection Network.IEEE International Workshop on Information Forensics and Security, 2018

work page 2018

[2] [2]

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.Advances in Neural Information Processing Systems, 2020

Baevski, A., Zhou, Y ., Mohamed, A., and Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.Advances in Neural Information Processing Systems, 2020

work page 2020

[3] [3]

Do You Re- ally Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization

Cai, Z., Stefanov, K., Dhall, A., and Hayat, M. Do You Re- ally Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization. InInternational Conference on Digital Image Computing: Techniques and Applications, 2022

work page 2022

[4] [4]

Glitch in the matrix: A large scale benchmark for content driven audio–visual forgery detection and lo- calization

Hayat, M. Glitch in the matrix: A large scale benchmark for content driven audio–visual forgery detection and lo- calization. InComputer Vision and Image Understanding, 2023

work page 2023

[5] [5]

A V-Deepfake1M: A Large- Scale LLM-Driven Audio-Visual Deepfake Dataset

Gedeon, T., and Stefanov, K. A V-Deepfake1M: A Large- Scale LLM-Driven Audio-Visual Deepfake Dataset. In Proceedings of the 32nd ACM International Conference on Multimedia, 2024

work page 2024

[6] [6]

Full-Stage Pseudo Label Quality Enhancement for Weakly-Supervised Tem- poral Action Localization.IEEE Transactions on Circuits and Systems for Video Technology, 2025

Feng, Q., Li, W., Lin, T., and Chen, X. Full-Stage Pseudo Label Quality Enhancement for Weakly-Supervised Tem- poral Action Localization.IEEE Transactions on Circuits and Systems for Video Technology, 2025

work page 2025

[7] [7]

Fine-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization

Gao, J., Chen, M., and Xu, C. Fine-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

work page 2022

[8] [8]

Towards Open- world Generalized Deepfake Detection: General Feature Extraction via Unsupervised Domain Adaptation

Guo, M., Yin, Q., Lu, W., and Luo, X. Towards Open- world Generalized Deepfake Detection: General Feature Extraction via Unsupervised Domain Adaptation. InPro- ceedings of the 33rd ACM International Conference on Multimedia, 2025

work page 2025

[9] [9]

Masked Autoencoders Are Scalable Vision Learners

He, K., Chen, X., Xie, S., Li, Y ., Doll´ar, P., and Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

work page 2022

[10] [10]

Masked Au- toencoders that Listen.Advances in Neural Information Processing Systems, 2022

Galuba, W., Metze, F., and Feichtenhofer, C. Masked Au- toencoders that Listen.Advances in Neural Information Processing Systems, 2022

work page 2022

[11] [11]

In Ictu Oculi: Exposing AI Created Fake Videos by Detecting Eye Blinking.IEEE International Workshop on Information Forensics and Security, 2018

Li, Y ., Chang, M.-C., and Lyu, S. In Ictu Oculi: Exposing AI Created Fake Videos by Detecting Eye Blinking.IEEE International Workshop on Information Forensics and Security, 2018

work page 2018

[12] [12]

Multilevel semantic and adaptive actionness learning for weakly supervised tem- poral action localization.Neural Networks, 2025

Li, Z., Wang, Z., and Dong, C. Multilevel semantic and adaptive actionness learning for weakly supervised tem- poral action localization.Neural Networks, 2025

work page 2025

[13] [13]

Audio-Visual Tem- poral Forgery Detection Using Embedding-Level Fusion and Multi-Dimensional Contrastive Loss.IEEE Trans- actions on Circuits and Systems for Video Technology, 2024

Liu, M., Wang, J., Qian, X., and Li, H. Audio-Visual Tem- poral Forgery Detection Using Embedding-Level Fusion and Multi-Dimensional Contrastive Loss.IEEE Trans- actions on Circuits and Systems for Video Technology, 2024

work page 2024

[14] [14]

DomainForensics: Exposing Face Forgery Across Domains via Bi-Directional Adaptation.IEEE Transactions on Information Forensics and Security, 2024

Zhang, S. DomainForensics: Exposing Face Forgery Across Domains via Bi-Directional Adaptation.IEEE Transactions on Information Forensics and Security, 2024

work page 2024

[15] [15]

DiR- Loc: Disentanglement Representation Learning for Ro- bust Image Forgery Localization.IEEE Transactions on Dependable and Secure Computing, 2024

Sheng, Z., Qu, Z., Lu, W., Cao, X., and Huang, J. DiR- Loc: Disentanglement Representation Learning for Ro- bust Image Forgery Localization.IEEE Transactions on Dependable and Secure Computing, 2024

work page 2024

[16] [16]

SUMI- IFL: An Information-Theoretic Framework for Image Forgery Localization with Sufficiency and Minimality Constraints

Sheng, Z., Lu, W., Luo, X., Zhou, J., and Cao, X. SUMI- IFL: An Information-Theoretic Framework for Image Forgery Localization with Sufficiency and Minimality Constraints. InProceedings of the AAAI Conference on Artificial Intelligence, 2025. 9 Mining Forgery Traces from Reconstruction Error

work page 2025

[17] [17]

TriDet: Temporal Action Detection with Relative Bound- ary Modeling

Shi, D., Zhong, Y ., Cao, Q., Ma, L., Li, J., and Tao, D. TriDet: Temporal Action Detection with Relative Bound- ary Modeling. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2023

work page 2023

[18] [18]

VideoMae: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training.Advances in Neural Information Processing Systems, 2022

Tong, Z., Song, Y ., Wang, J., and Wang, L. VideoMae: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training.Advances in Neural Information Processing Systems, 2022

work page 2022

[19] [19]

Temporal Action Localization in the Deep Learning Era: A Sur- vey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Wang, B., Zhao, Y ., Yang, L., Long, T., and Li, X. Temporal Action Localization in the Deep Learning Era: A Sur- vey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024

[20] [20]

Weakly-Supervised Action Lo- calization by Hierarchically-structured Latent Attention Modeling

Wang, G., Zhao, P., Zhao, C., Yang, S., Cheng, J., Leng, L., Liao, J., and Guo, Q. Weakly-Supervised Action Lo- calization by Hierarchically-structured Latent Attention Modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

work page 2023

[21] [21]

Temporal Segment Networks for Action Recognition in Videos.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018

Wang, L., Xiong, Y ., Wang, Z., Qiao, Y ., Lin, D., Tang, X., and Van Gool, L. Temporal Segment Networks for Action Recognition in Videos.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018

work page 2018

[22] [22]

Weakly-supervised Audio Temporal Forgery Localization via Progressive Audio-language Co-learning Network

Wu, J., Xu, W., Lu, W., Luo, X., Yang, R., and Guo, S. Weakly-supervised Audio Temporal Forgery Localization via Progressive Audio-language Co-learning Network. In Proceedings of the International Joint Conference on Artificial Intelligence, 2025

work page 2025

[23] [23]

Dynamic Difference Learning With Spatio–Temporal Correlation for Deep- fake Video Detection.IEEE Transactions on Information Forensics and Security, 2023

Yin, Q., Lu, W., Li, B., and Huang, J. Dynamic Difference Learning With Spatio–Temporal Correlation for Deep- fake Video Detection.IEEE Transactions on Information Forensics and Security, 2023

work page 2023

[24] [24]

Yu, P., Fei, J., Gao, H., Feng, X., Xia, Z., and Chang, C. H. Unlocking the Capabilities of Large Vision-Language Models for Generalizable and Explainable Deepfake De- tection. InForty-second International Conference on Machine Learning, 2025

work page 2025

[25] [25]

CoLA: Weakly-Supervised Temporal Action Localization with Snippet Contrastive Learning

Zhang, C., Cao, M., Yang, D., Chen, J., and Zou, Y . CoLA: Weakly-Supervised Temporal Action Localization with Snippet Contrastive Learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

work page 2021

[26] [26]

ActionFormer: Localizing Moments of Actions with Transformers

Zhang, C.-L., Wu, J., and Li, Y . ActionFormer: Localizing Moments of Actions with Transformers. InEuropean Conference on Computer Vision, 2022

work page 2022

[27] [27]

Acoustics, Speech and Signal Processing, 2025

work page 2025

[28] [28]

UMMAFormer: A Universal Multimodal-adaptive Trans- former Framework for Temporal Forgery Localization

Zhang, R., Wang, H., Du, M., Liu, H., Zhou, Y ., and Zeng, Q. UMMAFormer: A Universal Multimodal-adaptive Trans- former Framework for Temporal Forgery Localization. In Proceedings of the 31st ACM International Conference on Multimedia, 2023

work page 2023

[29] [29]

MFMS: Learning Modality-Fused and Modality-Specific Features for Deepfake Detection and Localization Tasks

Li, Z., Hu, B., Feng, W., Gong, T., and Chu, Q. MFMS: Learning Modality-Fused and Modality-Specific Features for Deepfake Detection and Localization Tasks. InPro- ceedings of the 32nd ACM International Conference on Multimedia, 2024

work page 2024

[30] [30]

Fine-grained open-set deepfake detection via unsupervised domain adaptation.IEEE Transactions on Information Forensics and Security, 2024

Zhou, X., Han, H., Shan, S., and Chen, X. Fine-grained open-set deepfake detection via unsupervised domain adaptation.IEEE Transactions on Information Forensics and Security, 2024. 10

work page 2024