pith. sign in

arxiv: 2605.23113 · v1 · pith:ZUZ7MNINnew · submitted 2026-05-22 · 💻 cs.CV

Inconsistency-aware Multimodal Schr\"odinger Bridge for Deepfake Localization

Pith reviewed 2026-05-25 05:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords deepfake localizationmultimodal fusionSchrödinger Bridgecross-modal consistencyaudio-visual detectioninterval localizationasymmetric fusioninconsistency awareness
0
0 comments X

The pith

Inconsistency-aware multimodal Schrödinger Bridge improves deepfake localization by asymmetrically allocating fusion steps based on cross-modal consistency estimates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that symmetric cross-modal fusion in audio-visual deepfake detection spreads noise from forged to authentic modalities under single-sided or asynchronous attacks, which harms precise interval localization. It claims that a Schrödinger Bridge framework solves this by first running a lightweight coarse bridge to propose candidate intervals and compute consistency scores, then using those scores to select witness signals and allocate refinement steps asymmetrically across modalities. This unification of consistency estimation, information selection, and step scheduling matters because it produces time-aligned intervals without explicit noise injection or extra iterations. If the claim holds, high-precision boundary metrics improve especially on single-sided forgeries.

Core claim

IaMSB unifies consistency estimation, cross-modal information selection, and bridge-step scheduling within the Schrödinger Bridge. A lightweight coarse bridge proposes candidate intervals and estimates cross-modal consistency; these statistics select witness signals and allocate bridge steps asymmetrically. A refinement bridge then performs step-tuned fusion and outputs refined, time-aligned intervals. The approach anticipates single-sided and asynchronous forgeries, suppresses noise transfer via bottlenecked interaction, and avoids unnecessary iterations.

What carries the argument

The inconsistency-aware multimodal Schrödinger Bridge, which minimizes path-distribution discrepancy to produce consistency scores and enables asymmetric step allocation across modalities.

If this is right

  • Stabilizes strict-IoU boundary precision across benchmarks
  • Raises AP@0.95 by 3% to 10%
  • Yields improved high-precision localization particularly for single-sided forgeries
  • Suppresses cross-modal noise transfer without extra iterations

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same coarse-to-refinement structure with asymmetric allocation could be tested on video-text or other modality pairs where corruption is one-sided.
  • Path-distribution minimization in Schrödinger Bridges may offer advantages over diffusion models whenever explicit noise modeling is undesirable for consistency estimation.
  • The method implies that step scheduling derived from consistency statistics can generalize to other detection tasks that require temporal evidence under partial forgery.

Load-bearing premise

A lightweight coarse bridge can reliably propose candidate intervals and estimate cross-modal consistency so that these statistics can allocate bridge steps asymmetrically without introducing new errors.

What would settle it

A controlled test on single-sided forgery benchmarks in which the coarse bridge's interval proposals and consistency estimates produce AP@0.95 no higher than symmetric fusion baselines.

Figures

Figures reproduced from arXiv: 2605.23113 by Jiayu Xiong, Jing Wang, Jun Xue, Qi Zhang, Wanlong Wang.

Figure 1
Figure 1. Figure 1: A bridge estimates the cross-modal deepfake event set’s [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of IaMSB. The extractors E a and E v produce modality–specific token sequences. The Coarse/ Witness/ Refinement Schrodinger Bridges (CSB/WSB/RSB) are described in Secs. ¨ 3.3.2, 3.3.3, and 3.3.4. Heads µ m c , µm r follow Eq. (3); MHA(·) follows Eq. 2, with Q/K/V roles indicated. The coarse stage yields Nev latent events; after WSB screening they reduce to Nˆ a and Nˆ v . 3.1. Problem Definition a… view at source ↗
Figure 3
Figure 3. Figure 3: Compute scaling and directional scales. (a) Step–accuracy trade-off of IaMSB under a fixed budget (sensitive of [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Audio-visual deepfake localization demands interval-level outputs that serve as temporal evidence. Despite recent progress, symmetric fusion under single-sided or asynchronous forgeries propagates cross-modal noise, degrading high-precision localization. We present IaMSB, an inconsistency-aware multimodal Schr\"odinger Bridge (SB) that jointly estimates cross-modal consistency and performs interval-level localization. Unlike diffusion models, SB minimizes path-distribution discrepancy and yields consistency scores without explicit noise injection or denoising. With the Schr\"odinger Bridge (SB), IaMSB unifies consistency estimation, cross-modal information selection, and bridge-step scheduling in one framework. Specifically, a lightweight coarse bridge first proposes candidate intervals and estimates cross-modal consistency; these statistics select cross-modal witness signals and allocate bridge steps asymmetrically across modalities. A refinement bridge then performs step-tuned fusion and outputs refined, time-aligned intervals. IaMSB anticipates single-sided and asynchronous forgeries and, using bottlenecked cross-modal interaction with step allocation, suppresses noise transfer, avoids unnecessary iterations. Across benchmarks, IaMSB stabilizes strict-IoU boundary precision, raising AP@0.95 by 3%~10%, and yields improved high-precision localization, particularly for single-sided forgeries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces IaMSB, an inconsistency-aware multimodal Schrödinger Bridge for audio-visual deepfake localization. It claims that a lightweight coarse bridge proposes candidate intervals and estimates cross-modal consistency, which then enables witness-signal selection and asymmetric bridge-step allocation in a refinement bridge; this suppresses cross-modal noise transfer under single-sided or asynchronous forgeries. The method is said to stabilize strict-IoU boundary precision and raise AP@0.95 by 3–10 % across benchmarks, with particular gains on single-sided forgeries.

Significance. If the empirical gains and the coarse-to-refinement pipeline hold under rigorous validation, the work would supply a principled SB-based alternative to symmetric fusion that directly addresses noise propagation in multimodal localization, a practically relevant advance for high-precision deepfake evidence.

major comments (1)
  1. [Abstract] Abstract (third paragraph): the reported 3–10 % AP@0.95 improvement and the noise-suppression claim rest on the coarse bridge producing sufficiently accurate interval proposals and consistency scores; no independent validation metric, ablation, or training protocol for this stage is described, leaving the asymmetric scheduler’s correctness unverified and the central pipeline’s soundness dependent on an untested prerequisite.
minor comments (1)
  1. [Abstract] The abstract uses LaTeX markup (Schrödinger) that should be rendered consistently in the final manuscript.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting this important point about the coarse bridge stage. We address the concern directly below and commit to strengthening the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract (third paragraph): the reported 3–10 % AP@0.95 improvement and the noise-suppression claim rest on the coarse bridge producing sufficiently accurate interval proposals and consistency scores; no independent validation metric, ablation, or training protocol for this stage is described, leaving the asymmetric scheduler’s correctness unverified and the central pipeline’s soundness dependent on an untested prerequisite.

    Authors: We agree that the abstract (and, upon re-examination, the main text) does not provide separate validation metrics, ablations, or an explicit training protocol for the coarse bridge in isolation. The current description presents the coarse bridge as a lightweight first stage trained jointly within the overall IaMSB objective, with its outputs directly feeding the refinement stage. However, this leaves the prerequisite accuracy of interval proposals and consistency scores unverified independently. We will revise the manuscript to include: (1) a dedicated subsection detailing the coarse bridge architecture, loss, and training protocol; (2) an ablation that reports standalone metrics for the coarse stage (e.g., proposal recall at various IoU thresholds and consistency score correlation with ground-truth forgery labels); and (3) an analysis of how coarse-stage errors propagate to the final AP@0.95. These additions will be placed in Section 3 and the experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation chain is self-contained with independent SB formulation and empirical gains

full rationale

The paper presents IaMSB as a multimodal Schrödinger Bridge framework that unifies consistency estimation, witness selection, and asymmetric step allocation via a coarse-to-refinement pipeline. No quoted equations or steps reduce outputs to inputs by construction, rename fitted parameters as predictions, or rely on self-citations for load-bearing uniqueness claims. The coarse bridge's interval proposal and consistency scoring are described as operational components whose reliability is presupposed for the pipeline but not derived from the target localization metrics; reported AP@0.95 gains are positioned as empirical outcomes rather than tautological. The derivation therefore remains externally falsifiable against benchmarks without reducing to self-definition or fitted-input renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so the ledger records only elements explicitly named; no free parameters, additional axioms, or invented entities beyond the model itself are described.

axioms (1)
  • standard math Schrödinger Bridge minimizes path-distribution discrepancy and yields consistency scores without explicit noise injection or denoising.
    Stated in the abstract as the property that distinguishes SB from diffusion models and enables the consistency estimation.
invented entities (1)
  • IaMSB (inconsistency-aware multimodal Schrödinger Bridge) no independent evidence
    purpose: Jointly estimates cross-modal consistency and performs interval-level deepfake localization via coarse and refinement bridges.
    Newly introduced model name and architecture described in the abstract.

pith-pipeline@v0.9.0 · 5748 in / 1322 out tokens · 29246 ms · 2026-05-25T05:17:51.620223+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 1 internal anchor

  1. [1]

    Cav-mae sync: Improving contrastive audio-visual mask au- toencoders via fine-grained alignment

    Edson Araujo, Andrew Rouditchenko, Yuan Gong, Saurab- hchand Bhati, Samuel Thomas, Brian Kingsbury, Leonid Karlinsky, Rogerio Feris, James R Glass, and Hilde Kuehne. Cav-mae sync: Improving contrastive audio-visual mask au- toencoders via fine-grained alignment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pa...

  2. [2]

    Ceebert: Cross-domain inference in early exit BERT

    Divya Jyoti Bajpai and Manjesh Kumar Hanawal. Ceebert: Cross-domain inference in early exit BERT. InFindings of the Association for Computational Linguistics (ACL Find- ings), pages 1736–1748, 2024. 1

  3. [3]

    Do you really mean that? content driven audio- visual deepfake dataset and multimodal method for temporal forgery localization

    Zhixi Cai, Kalin Stefanov, Abhinav Dhall, and Munawar Hayat. Do you really mean that? content driven audio- visual deepfake dataset and multimodal method for temporal forgery localization. InInternational Conference on Digital Image Computing: Techniques and Applications (DICTA), pages 1–10, 2022. 1, 2, 5, 6, 7

  4. [4]

    Glitch in the matrix: A large scale benchmark for content driven audio–visual forgery detection and localization.Computer Vision and Im- age Understanding (CVIU), 236:103818, 2023

    Zhixi Cai, Shreya Ghosh, Abhinav Dhall, Tom Gedeon, Kalin Stefanov, and Munawar Hayat. Glitch in the matrix: A large scale benchmark for content driven audio–visual forgery detection and localization.Computer Vision and Im- age Understanding (CVIU), 236:103818, 2023. 1, 2, 5, 6, 7

  5. [5]

    Av-deepfake1m: A large-scale llm-driven audio-visual deep- fake dataset

    Zhixi Cai, Shreya Ghosh, Aman Pankaj Adatia, Munawar Hayat, Abhinav Dhall, Tom Gedeon, and Kalin Stefanov. Av-deepfake1m: A large-scale llm-driven audio-visual deep- fake dataset. InProceedings of the ACM International Con- ference on Multimedia (ACM MM), 2024. 2, 5

  6. [6]

    Madtp: Multi- modal alignment-guided dynamic token pruning for accel- erating vision-language transformer

    Jianjian Cao, Peng Ye, Shengze Li, Chong Yu, Yan- song Tang, Jiwen Lu, and Tao Chen. Madtp: Multi- modal alignment-guided dynamic token pruning for accel- erating vision-language transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15710–15719, 2024. 1

  7. [7]

    Wavlm: Large-scale self- supervised pre-training for full stack speech processing

    Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self- supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16 (6):1505–1518, 2022. 6

  8. [8]

    Diffdvc: Accurate event detection for dense video captioning via diffusion models

    Wei Chen, Jianwei Niu, Xuefeng Liu, Zhendong Wang, Shaojie Tang, and Guogang Zhu. Diffdvc: Accurate event detection for dense video captioning via diffusion models. InProceedings of the AAAI Conference on Artificial Intelli- gence (AAAI), pages 2221–2229, 2025. 2

  9. [9]

    Deep fakes: A loom- ing challenge for privacy, democracy, and national security

    Bobby Chesney and Danielle Citron. Deep fakes: A loom- ing challenge for privacy, democracy, and national security. California Law Review (Cal. L. Rev.), 107:1753, 2019. 1

  10. [10]

    Not made for each other-audio- visual dissonance-based deepfake detection and localization

    Komal Chugh, Parul Gupta, Abhinav Dhall, and Ra- manathan Subramanian. Not made for each other-audio- visual dissonance-based deepfake detection and localization. InACM international conference on multimedia (ACM MM), pages 439–447, 2020. 1, 6

  11. [11]

    Diffusion schr¨odinger bridge with applications to score-based generative modeling

    Valentin De Bortoli, James Thornton, Jeremy Heng, and Ar- naud Doucet. Diffusion schr¨odinger bridge with applications to score-based generative modeling. InAdvances in Neural Information Processing Systems (NeurIPS), pages 17695– 17709, 2021. 1, 2

  12. [12]

    Reflected schr ¨odinger bridge for constrained generative modeling

    Wei Deng, Yu Chen, Nicole Tianjiao Yang, Hengrong Du, Qi Feng, and Ricky Tian Qi Chen. Reflected schr ¨odinger bridge for constrained generative modeling. InUncertainty in Artificial Intelligence, pages 1055–1082, 2024. 2

  13. [13]

    Contrastive audio-visual masked autoencoder

    Yuan Gong, Andrew Rouditchenko, Alexander H Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, and James R Glass. Contrastive audio-visual masked autoencoder. InPro- ceedings of the International Conference on Learning Rep- resentations (ICLR), 2023. 2

  14. [14]

    Adap- tive multimodal fusion: Dynamic attention allocation for in- tent recognition

    Bo Hu, Kai Zhang, Yanghai Zhang, and Yuyang Ye. Adap- tive multimodal fusion: Dynamic attention allocation for in- tent recognition. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 17267–17275, 2025. 1

  15. [15]

    Mavil: Masked audio-video learners

    Po-Yao Huang, Vasu Sharma, Hu Xu, Chaitanya Ryali, Yanghao Li, Shang-Wen Li, Gargi Ghosh, Jitendra Malik, Christoph Feichtenhofer, et al. Mavil: Masked audio-video learners. InAdvances in Neural Information Processing Sys- tems (NeurIPS), pages 20371–20393, 2023. 2

  16. [16]

    Generic event boundary detection via denoising dif- fusion

    Jaejun Hwang, Dayoung Gong, Manjin Kim, and Minsu Cho. Generic event boundary detection via denoising dif- fusion. InProceedings of the International Conference on Computer Vision (ICCV), pages 14084–14094, 2025. 2

  17. [17]

    Language- guided contrastive audio-visual masked autoencoder with automatically generated audio-visual-text triplets from videos

    Yuchi Ishikawa, Shota Nakada, Hokuto Munakata, Kazuhiro Saito, Tatsuya Komatsu, and Yoshimitsu Aoki. Language- guided contrastive audio-visual masked autoencoder with automatically generated audio-visual-text triplets from videos. InProc. Interspeech, pages 2605–2609, 2025. 2

  18. [18]

    Contextual cross- modal attention for audio-visual deepfake detection and lo- calization

    Vinaya Sree Katamneni and Ajita Rattani. Contextual cross- modal attention for audio-visual deepfake detection and lo- calization. InIEEE International Joint Conference on Bio- metrics (IJCB), pages 1–11. IEEE, 2024. 1, 2, 6

  19. [19]

    Dimodif: Discourse modality-information differentiation for audio- visual deepfake detection and localization.arXiv preprint arXiv:2411.10193, 2024

    Christos Koutlis and Symeon Papadopoulos. Dimodif: Discourse modality-information differentiation for audio- visual deepfake detection and localization.arXiv preprint arXiv:2411.10193, 2024. 2, 6, 7

  20. [20]

    Miao Liu, Jing Wang, Xinyuan Qian, and Haizhou Li. Audio-visual temporal forgery detection using embedding- level fusion and multi-dimensional contrastive loss.IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 34(8):6937–6948, 2023. 1, 6

  21. [21]

    Lips are lying: Spot- ting the temporal inconsistency between audio and visual in lip-syncing deepfakes

    Weifeng Liu, Tianyi She, Jiawei Liu, Boheng Li, Dongyu Yao, Ziyou Liang, and Run Wang. Lips are lying: Spot- ting the temporal inconsistency between audio and visual in lip-syncing deepfakes. InAdvances in Neural Information Processing Systems (NeurIPS), pages 91131–91155, 2024. 2

  22. [22]

    Decoupled weight de- cay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations (ICLR), 2019. 5

  23. [23]

    The creation and detection of deepfakes: A survey.ACM computing surveys (CSUR), 54(1):1–41, 2021

    Yisroel Mirsky and Wenke Lee. The creation and detection of deepfakes: A survey.ACM computing surveys (CSUR), 54(1):1–41, 2021. 1

  24. [24]

    Avff: Audio-visual feature fusion for video deepfake detection

    Trevine Oorloff, Surya Koppisetti, Nicol `o Bonettini, Di- vyaraj Solanki, Ben Colman, Yaser Yacoob, Ali Shahriyari, and Gaurav Bharaj. Avff: Audio-visual feature fusion for video deepfake detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27102–27112, 2024. 1

  25. [25]

    Early-exit deep neural network-a comprehensive survey.ACM Computing Surveys, 57(3):1–37, 2024

    Haseena Rahmath P, Vishal Srivastava, Kuldeep Chaurasia, Roberto G Pacheco, and Rodrigo S Couto. Early-exit deep neural network-a comprehensive survey.ACM Computing Surveys, 57(3):1–37, 2024. 1

  26. [26]

    Learning audio-visual speech representa- tion by masked multimodal cluster prediction.arXiv preprint arXiv:2201.02184, 2022

    Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, and Abdelrah- man Mohamed. Learning audio-visual speech representa- tion by masked multimodal cluster prediction.arXiv preprint arXiv:2201.02184, 2022. 2

  27. [27]

    Hic- mae: Hierarchical contrastive masked autoencoder for self- supervised audio-visual emotion recognition.Information Fusion, 108:102382, 2024

    Licai Sun, Zheng Lian, Bin Liu, and Jianhua Tao. Hic- mae: Hierarchical contrastive masked autoencoder for self- supervised audio-visual emotion recognition.Information Fusion, 108:102382, 2024. 2

  28. [28]

    Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural in- formation processing systems (NeurIPS), 35:10078–10093,

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural in- formation processing systems (NeurIPS), 35:10078–10093,

  29. [29]

    Media forensics and deepfakes: an overview.IEEE journal of selected topics in signal process- ing, 14(5):910–932, 2020

    Luisa Verdoliva. Media forensics and deepfakes: an overview.IEEE journal of selected topics in signal process- ing, 14(5):910–932, 2020. 1

  30. [30]

    Zero- tprune: Zero-shot token pruning through leveraging of the attention graph in pre-trained transformers

    Hongjie Wang, Bhishma Dedhia, and Niraj K Jha. Zero- tprune: Zero-shot token pruning through leveraging of the attention graph in pre-trained transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 16070–16079, 2024. 1

  31. [31]

    Faster diffusion action segmentation.arXiv preprint arXiv:2408.02024, 2024

    Shuaibing Wang, Shunli Wang, Mingcheng Li, Dingkang Yang, Haopeng Kuang, Ziyun Qian, and Lihua Zhang. Faster diffusion action segmentation.arXiv preprint arXiv:2408.02024, 2024. 1, 2

  32. [32]

    Diagnosing and re-learning for balanced multimodal learning

    Yake Wei, Siwei Li, Ruoxuan Feng, and Di Hu. Diagnosing and re-learning for balanced multimodal learning. InEuro- pean Conference on Computer Vision (ECCV), pages 71–86,

  33. [33]

    Coarse-to-fine proposal refinement framework for audio temporal forgery detection and local- ization

    Junyan Wu, Wei Lu, Xiangyang Luo, Rui Yang, Qian Wang, and Xiaochun Cao. Coarse-to-fine proposal refinement framework for audio temporal forgery detection and local- ization. InACM International Conference on Multimedia (ACM MM), pages 7395–7403, 2024. 1, 2

  34. [34]

    Deep Multimodal Learning with Missing Modality: A Survey

    Renjie Wu, Hu Wang, Hsiang-Ting Chen, and Gustavo Carneiro. Deep multimodal learning with missing modality: A survey.arXiv preprint arXiv:2409.07825, 2024. 1

  35. [35]

    Boundary denoising for video activity localization

    Mengmeng Xu, Mattia Soldan, Jialin Gao, Shuming Liu, Juan Manuel P ´erez-R´ua, and Bernard Ghanem. Boundary denoising for video activity localization. InProceedings of the International Conference on Learning Representations (ICLR), 2024. 2

  36. [36]

    Rethink cross- modal fusion in weakly-supervised audio-visual video pars- ing

    Yating Xu, Conghui Hu, and Gim Hee Lee. Rethink cross- modal fusion in weakly-supervised audio-visual video pars- ing. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5615–5624,

  37. [37]

    Avoid-df: Audio-visual joint learning for detecting deepfake

    Wenyuan Yang, Xiaoyu Zhou, Zhikai Chen, Bofei Guo, Zhongjie Ba, Zhihua Xia, Xiaochun Cao, and Kui Ren. Avoid-df: Audio-visual joint learning for detecting deepfake. IEEE Transactions on Information Forensics and Security (TIFS), 18:2015–2029, 2023. 1, 4

  38. [38]

    Facilitating multimodal classification via dynamically learn- ing modality gap

    Yang Yang, Fengqiang Wan, Qing-Yuan Jiang, and Yi Xu. Facilitating multimodal classification via dynamically learn- ing modality gap. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), pages 62108–62122, 2024. 1

  39. [39]

    Actionformer: Localizing moments of actions with transformers

    Chen-Lin Zhang, Jianxin Wu, and Yin Li. Actionformer: Localizing moments of actions with transformers. InEuro- pean Conference on Computer Vision (ECCV), pages 492– 510, 2022. 6, 7

  40. [40]

    Ummaformer: A univer- sal multimodal-adaptive transformer framework for tempo- ral forgery localization

    Rui Zhang, Hongxia Wang, Mingshan Du, Hanqing Liu, Yang Zhou, and Qiang Zeng. Ummaformer: A univer- sal multimodal-adaptive transformer framework for tempo- ral forgery localization. InACM International Conference on Multimedia (ACM MM), pages 8749–8759, 2023. 1, 2, 4, 5, 6, 7

  41. [41]

    Query-based audio- visual temporal forgery localization with register-enhanced representation learning

    Xiaodong Zhu, Suting Wang, Junqi Yang, Yuhong Yang, Weiping Tu, and Zhongyuan Wang. Query-based audio- visual temporal forgery localization with register-enhanced representation learning. InProceedings of the ACM Interna- tional Conference on Multimedia (ACM MM), pages 8547– 8556, 2025. 1, 2, 6, 7