pith. machine review for the scientific record. sign in

arxiv: 2604.23957 · v1 · submitted 2026-04-27 · 💻 cs.CV

LAVA: Layered Audio-Visual Anti-tampering Watermarking for Robust Deepfake Detection and Localization

Pith reviewed 2026-05-08 04:55 UTC · model grok-4.3

classification 💻 cs.CV
keywords deepfake detectionwatermarkingaudio-visual fusiontamper localizationrobust watermarkingvideo compressiondeepfake localization
0
0 comments X

The pith

LAVA fuses watermarks across audio and visual layers to maintain deepfake detection and localization under compression and desynchronization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes LAVA as a proactive watermarking system that embeds signals in both audio and visual tracks of short videos and then fuses them with calibration-aware alignment. This design aims to keep tamper evidence intact even when videos undergo codec compression or lose audio-visual timing sync, problems that break earlier decoupled or fragile watermark methods. A reader would care because deepfake videos now circulate at scale on social platforms where compression and editing are routine, so reliable preemptive detection could limit their spread. If the approach holds, it would let platforms or creators verify authenticity without waiting for post-hoc forensic analysis. The experiments claim this yields near-perfect detection scores while outperforming prior audio-visual baselines on localization tasks.

Core claim

LAVA is a calibration-aware audio-visual watermark fusion framework that leverages cross-modal watermark fusion and calibration-aware alignment to preserve consistent and reliable tamper evidence under compression and audio-visual asynchrony, enabling robust tamper detection and localization.

What carries the argument

The layered audio-visual anti-tampering watermarking framework, which embeds and fuses signals in audio and visual layers with calibration to survive degradations.

If this is right

  • Detection average precision reaches 0.999 on the evaluated deepfake datasets.
  • Tamper localization becomes more reliable than existing audio-visual fusion baselines.
  • Performance holds under common compression codecs and moderate multimodal misalignment.
  • The method avoids the frequency-band overlap that causes earlier visual watermarks to fail under compression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Platforms could embed LAVA-style watermarks at upload time to create an authenticity layer for short-form video.
  • The same fusion idea might apply to other paired modalities such as video and text subtitles.
  • Adversaries might develop targeted attacks that simultaneously degrade both audio and visual watermark channels.
  • Widespread adoption would shift verification from reactive forensics to proactive content signing.

Load-bearing premise

Cross-modal watermark fusion and alignment can keep tamper signals consistent and detectable after real-world codec compression and audio-visual timing shifts.

What would settle it

Run LAVA on a test set of videos compressed with H.264 or VP9 at low bitrates and with added audio delays of 100-500 ms; if average precision falls substantially below 0.999 or localization IoU drops, the claim fails.

Figures

Figures reproduced from arXiv: 2604.23957 by Bokang Zeng, Jiaojiao Jiang, Xiaoyan Feng, Xiaoyu Li, Zheng Gao.

Figure 1
Figure 1. Figure 1: Overview of the LAVA pipeline. Independent visual and audio integrity watermarks are embedded before distribution. view at source ↗
Figure 2
Figure 2. Figure 2: Spatiotemporal detection on LAV-DF. Top: per view at source ↗
read the original abstract

Proactive watermarking offers a promising approach for deepfake tamper detection and localization in short-form videos. However, existing methods often decouple audio and visual evidence and assume that watermark signals remain reliable under real-world degradations, making tamper localization vulnerable to multimodal misalignment and compression distortions. Moreover, existing semi-fragile visual watermarking methods often degrade significantly under codec compression because their embedding bands overlap with compression-sensitive frequency regions. To address these limitations, we propose Layered Audio-Visual Anti-tampering Watermarking (LAVA), a calibration-aware audio-visual watermark fusion framework for deepfake tamper detection and localization. LAVA leverages cross-modal watermark fusion and calibration-aware alignment to preserve consistent and reliable tamper evidence under compression and audio-visual asynchrony, enabling robust tamper localization. Extensive experiments demonstrate that LAVA achieves near-perfect detection performance (AP = 0.999), remains robust to compression and multimodal misalignment, and significantly improves tamper localization reliability over existing audio-visual fusion baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes LAVA, a calibration-aware audio-visual watermark fusion framework for deepfake tamper detection and localization in short-form videos. It addresses limitations in prior methods that decouple audio and visual signals and are vulnerable to multimodal misalignment and codec compression by using cross-modal watermark fusion and calibration-aware alignment to preserve tamper evidence. The work reports near-perfect detection (AP = 0.999), robustness to compression and asynchrony, and improved localization over audio-visual fusion baselines.

Significance. If the reported results and robustness claims hold under full experimental scrutiny, this represents a meaningful advance in proactive watermarking for multimodal deepfake mitigation. The emphasis on handling real-world degradations like compression and asynchrony directly targets documented weaknesses in semi-fragile visual watermarking and could support more reliable tamper localization in practical video authenticity systems.

major comments (1)
  1. The provided manuscript consists solely of the abstract, with no access to methods, equations, ablation studies, experimental protocols, datasets, or baseline details. This prevents verification of whether the central claims (e.g., AP = 0.999 and robustness under compression/asynchrony) are supported by the evidence, making the soundness of the contribution impossible to assess at present.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for acknowledging the potential significance of LAVA in addressing real-world challenges in multimodal deepfake mitigation. We address the major comment point by point below.

read point-by-point responses
  1. Referee: The provided manuscript consists solely of the abstract, with no access to methods, equations, ablation studies, experimental protocols, datasets, or baseline details. This prevents verification of whether the central claims (e.g., AP = 0.999 and robustness under compression/asynchrony) are supported by the evidence, making the soundness of the contribution impossible to assess at present.

    Authors: We apologize for the submission error that resulted in only the abstract being provided for review. The complete manuscript, available on arXiv:2604.23957, includes the full methods section detailing the cross-modal watermark fusion and calibration-aware alignment, supporting equations, ablation studies on robustness to compression and asynchrony, experimental protocols, datasets, and baseline comparisons. These elements directly substantiate the reported AP = 0.999 and robustness claims. We will revise the submission to include the full manuscript for proper assessment. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes the LAVA framework for audio-visual watermarking in deepfake detection, describing cross-modal fusion and calibration-aware alignment at a high level. Claims of near-perfect detection (AP=0.999) and robustness are framed as outcomes of extensive experiments rather than mathematical derivations. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text that would reduce any result to its inputs by construction. The approach addresses prior limitations through design choices validated empirically, remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no specific free parameters, axioms, or invented entities can be extracted or evaluated.

pith-pipeline@v0.9.0 · 5484 in / 1184 out tokens · 76126 ms · 2026-05-08T04:55:40.479683+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 15 canonical work pages

  1. [1]

    Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2019. Multi- modal Machine Learning: A Survey and Taxonomy.IEEE Transactions on Pattern Analysis and Machine Intelligence41, 2 (2019), 423–443. doi:10.1109/TPAMI.2018. 2798607

  2. [2]

    Zhixi Cai, Shreya Ghosh, Aman Pankaj Adatia, Munawar Hayat, Abhinav Dhall, Tom Gedeon, and Kalin Stefanov. 2024. AV-Deepfake1M: A Large- Scale LLM-Driven Audio-Visual Deepfake Dataset. InProceedings of the 32nd ACM International Conference on Multimedia(Melbourne VIC, Australia)(MM ’24). Association for Computing Machinery, New York, NY, USA, 7414–7423. d...

  3. [3]

    Zhixi Cai, Shreya Ghosh, Abhinav Dhall, Tom Gedeon, Kalin Stefanov, and Munawar Hayat. 2023. Glitch in the matrix: A large scale benchmark for content driven audio–visual forgery detection and localization.Computer Vision and Image Understanding236 (2023), 103818. doi:10.1016/j.cviu.2023.103818

  4. [4]

    Guangyu Chen, Yu Wu, Shujie Liu, Tao Liu, Xiaoyong Du, and Furu Wei. 2024. WavMark: Watermarking for Audio Generation. arXiv:2308.12770 [cs.SD] https: //arxiv.org/abs/2308.12770

  5. [5]

    Komal Chugh, Parul Gupta, Abhinav Dhall, and Ramanathan Subramanian. 2020. Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization. InProceedings of the 28th ACM International Conference on Multimedia(Seattle, WA, USA)(MM ’20). Association for Computing Machinery, New York, NY, USA, 439–447. doi:10.1145/3394171.3413700

  6. [6]

    Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. InInterspeech 2018 (interspeech 2018). ISCA, 1086–1090. doi:10.21437/interspeech.2018-1929

  7. [7]

    Pierre Fernandez, Guillaume Couairon, Hervé Jégou, Matthijs Douze, and Teddy Furon. 2023. The Stable Signature: Rooting Watermarks in Latent Diffusion Models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 22466–22477

  8. [8]

    Zeki Yalniz, and Alexandre Mourachko

    Pierre Fernandez, Hady Elsahar, I. Zeki Yalniz, and Alexandre Mourachko. 2024. Video Seal: Open and Efficient Video Watermarking. arXiv:2412.09492 [cs.MM] https://arxiv.org/abs/2412.09492

  9. [9]

    Weinberger

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration of Modern Neural Networks. InProceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 1321–1330. https://proceedings.mlr. press/v70/guo17a.html

  10. [10]

    Alexandros Haliassos, Rodrigo Mira, Stavros Petridis, and Maja Pantic. 2022. Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detec- tion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14950–14962

  11. [11]

    Anil Jain, Karthik Nandakumar, and Arun Ross. 2005. Score normalization in multimodal biometric systems.Pattern Recognition38, 12 (2005), 2270–2285. doi:10.1016/j.patcog.2005.01.012

  12. [12]

    Zhaoyang Jia, Han Fang, and Weiming Zhang. 2021. MBRS: Enhancing Robust- ness of DNN-based Watermarking by Mini-Batch of Real and Simulated JPEG Compression. InProceedings of the 29th ACM International Conference on Multi- media(Virtual Event, China)(MM ’21). Association for Computing Machinery, New York, NY, USA, 41–49. doi:10.1145/3474085.3475324

  13. [13]

    Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong- Jin Lee, Ha-Jin Yu, and Nicholas Evans. 2022. AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6367–6371. doi:10.1109/ICASSP43922.2022.9747766

  14. [14]

    Hasam Khalid, Shahroz Tariq, Minha Kim, and Simon S. Woo. 2022. FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset. arXiv:2108.05080 [cs.CV] https://arxiv.org/abs/2108.05080

  15. [15]

    Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, and Baining Guo. 2020. Face x-ray for more general face forgery detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5001–5010

  16. [16]

    Eugene T Lin and Edward J Delp. 1999. A review of fragile image watermarks. InProceedings of the ACM Multimedia Security Workshop. 47–51

  17. [17]

    Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. 2015. Obtain- ing well calibrated probabilities using bayesian binning. InProceedings of the AAAI conference on artificial intelligence, Vol. 29

  18. [18]

    Paarth Neekhara, Shehzeen Hussain, Xinqiao Zhang, Ke Huang, Julian McAuley, and Farinaz Koushanfar. 2024. FaceSigns: Semi-fragile watermarks for media authentication.ACM Transactions on Multimedia Computing, Communications and Applications20, 11 (2024), 1–21

  19. [19]

    Gan Pei, Jiangning Zhang, Menghan Hu, Zhenyu Zhang, Chengjie Wang, Yun- sheng Wu, Guangtao Zhai, Jian Yang, and Dacheng Tao. 2026. Deepfake Genera- tion and Detection: A Benchmark and Survey.ACM Comput. Surv.(March 2026). doi:10.1145/3801962 Just Accepted

  20. [20]

    Namboodiri, and C.V

    K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V. Jawahar

  21. [21]

    InProceedings of the 28th ACM International Conference on Multimedia (Seattle, WA, USA)(MM ’20)

    A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. InProceedings of the 28th ACM International Conference on Multimedia (Seattle, WA, USA)(MM ’20). Association for Computing Machinery, New York, NY, USA, 484–492. doi:10.1145/3394171.3413532

  22. [22]

    Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. 2020. Thinking in Frequency: Face Forgery Detection by Mining Frequency-Aware Clues. In Computer Vision – ECCV 2020, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer International Publishing, Cham, 86–103

  23. [23]

    Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. 2019. Faceforensics++: Learning to detect manipulated facial images. InProceedings of the IEEE/CVF international conference on computer vision. 1–11

  24. [24]

    Robin San Roman, Pierre Fernandez, Hady Elsahar, Alexandre Défossez, Teddy Furon, and Tuan Tran. 2024. Proactive Detection of Voice Cloning with Localized Watermarking. InInternational Conference on Machine Learning. PMLR, 43180– 43196

  25. [25]

    Tom Sander, Pierre Fernandez, Alain Oliviero Durmus, Teddy Furon, and Matthijs Douze. 2025. Watermark Anything With Localized Messages. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/ forum?id=IkZVDzdC8M

  26. [26]

    Kaede Shiohara and Toshihiko Yamasaki. 2022. Detecting Deepfakes With Self- Blended Images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18720–18729

  27. [27]

    Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, and Anthony Larcher. 2021. End-to-End anti-spoofing with RawNet2. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6369–6373. doi:10.1109/ICASSP39728.2021.9414234

  28. [28]

    Hemlata Tak, Massimiliano Todisco, Xin Wang, Jee weon Jung, Junichi Yamagishi, and Nicholas Evans. 2022. Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation. arXiv:2202.12233 [eess.AS] https://arxiv.org/abs/2202.12233

  29. [29]

    Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. InInternational conference on machine learning. PMLR, 6105–6114

  30. [30]

    Matthew Tancik, Ben Mildenhall, and Ren Ng. 2020. Stegastamp: Invisible hyperlinks in physical photographs. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2117–2126

  31. [31]

    Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Héctor Delgado, An- dreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, et al. 2020. ASVspoof 2019: A large-scale public database of syn- thesized, converted and replayed speech.Computer Speech & Language64 (2020), 101114

  32. [32]

    Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, and Houqiang Li

  33. [33]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    AltFreezing for More General Video Face Forgery Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4129–4138. Bokang Zeng, Zheng Gao, Xiaoyu Li, Xiaoyan Feng, and Jiaojiao Jiang

  34. [34]

    Yuxin Wen, John Kirchenbauer, Jonas Geiping, and Tom Goldstein. 2023. Tree- rings watermarks: Invisible fingerprints for diffusion images.Advances in Neural Information Processing Systems36 (2023), 58047–58063

  35. [35]

    Zhiyuan Yan, Yong Zhang, Yanbo Fan, and Baoyuan Wu. 2023. UCF: Uncovering Common Features for Generalizable Deepfake Detection. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 22412–22423

  36. [36]

    Xuanyu Zhang, Runyi Li, Jiwen Yu, Youmin Xu, Weiqi Li, and Jian Zhang. 2024. EditGuard: Versatile Image Watermarking for Tamper Localization and Copyright Protection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11964–11974

  37. [37]

    Xuanyu Zhang, Zecheng Tang, Zhipei Xu, Runyi Li, Youmin Xu, Bin Chen, Feng Gao, and Jian Zhang. 2025. OmniGuard: Hybrid Manipulation Localization via Augmented Versatile Deep Image Watermarking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3008–3018

  38. [38]

    Yinglin Zheng, Jianmin Bao, Dong Chen, Ming Zeng, and Fang Wen. 2021. Ex- ploring Temporal Coherence for More General Video Face Forgery Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 15044–15054

  39. [39]

    Yipin Zhou and Ser-Nam Lim. 2021. Joint audio-visual deepfake detection. In Proceedings of the IEEE/CVF international conference on computer vision. 14800– 14809

  40. [40]

    Jiren Zhu, Russell Kaplan, Justin Johnson, and Li Fei-Fei. 2018. Hidden: Hiding data with deep networks. InProceedings of the European conference on computer vision (ECCV). 657–672