LAVA: Layered Audio-Visual Anti-tampering Watermarking for Robust Deepfake Detection and Localization
Pith reviewed 2026-05-08 04:55 UTC · model grok-4.3
The pith
LAVA fuses watermarks across audio and visual layers to maintain deepfake detection and localization under compression and desynchronization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LAVA is a calibration-aware audio-visual watermark fusion framework that leverages cross-modal watermark fusion and calibration-aware alignment to preserve consistent and reliable tamper evidence under compression and audio-visual asynchrony, enabling robust tamper detection and localization.
What carries the argument
The layered audio-visual anti-tampering watermarking framework, which embeds and fuses signals in audio and visual layers with calibration to survive degradations.
If this is right
- Detection average precision reaches 0.999 on the evaluated deepfake datasets.
- Tamper localization becomes more reliable than existing audio-visual fusion baselines.
- Performance holds under common compression codecs and moderate multimodal misalignment.
- The method avoids the frequency-band overlap that causes earlier visual watermarks to fail under compression.
Where Pith is reading between the lines
- Platforms could embed LAVA-style watermarks at upload time to create an authenticity layer for short-form video.
- The same fusion idea might apply to other paired modalities such as video and text subtitles.
- Adversaries might develop targeted attacks that simultaneously degrade both audio and visual watermark channels.
- Widespread adoption would shift verification from reactive forensics to proactive content signing.
Load-bearing premise
Cross-modal watermark fusion and alignment can keep tamper signals consistent and detectable after real-world codec compression and audio-visual timing shifts.
What would settle it
Run LAVA on a test set of videos compressed with H.264 or VP9 at low bitrates and with added audio delays of 100-500 ms; if average precision falls substantially below 0.999 or localization IoU drops, the claim fails.
Figures
read the original abstract
Proactive watermarking offers a promising approach for deepfake tamper detection and localization in short-form videos. However, existing methods often decouple audio and visual evidence and assume that watermark signals remain reliable under real-world degradations, making tamper localization vulnerable to multimodal misalignment and compression distortions. Moreover, existing semi-fragile visual watermarking methods often degrade significantly under codec compression because their embedding bands overlap with compression-sensitive frequency regions. To address these limitations, we propose Layered Audio-Visual Anti-tampering Watermarking (LAVA), a calibration-aware audio-visual watermark fusion framework for deepfake tamper detection and localization. LAVA leverages cross-modal watermark fusion and calibration-aware alignment to preserve consistent and reliable tamper evidence under compression and audio-visual asynchrony, enabling robust tamper localization. Extensive experiments demonstrate that LAVA achieves near-perfect detection performance (AP = 0.999), remains robust to compression and multimodal misalignment, and significantly improves tamper localization reliability over existing audio-visual fusion baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes LAVA, a calibration-aware audio-visual watermark fusion framework for deepfake tamper detection and localization in short-form videos. It addresses limitations in prior methods that decouple audio and visual signals and are vulnerable to multimodal misalignment and codec compression by using cross-modal watermark fusion and calibration-aware alignment to preserve tamper evidence. The work reports near-perfect detection (AP = 0.999), robustness to compression and asynchrony, and improved localization over audio-visual fusion baselines.
Significance. If the reported results and robustness claims hold under full experimental scrutiny, this represents a meaningful advance in proactive watermarking for multimodal deepfake mitigation. The emphasis on handling real-world degradations like compression and asynchrony directly targets documented weaknesses in semi-fragile visual watermarking and could support more reliable tamper localization in practical video authenticity systems.
major comments (1)
- The provided manuscript consists solely of the abstract, with no access to methods, equations, ablation studies, experimental protocols, datasets, or baseline details. This prevents verification of whether the central claims (e.g., AP = 0.999 and robustness under compression/asynchrony) are supported by the evidence, making the soundness of the contribution impossible to assess at present.
Simulated Author's Rebuttal
We thank the referee for their review and for acknowledging the potential significance of LAVA in addressing real-world challenges in multimodal deepfake mitigation. We address the major comment point by point below.
read point-by-point responses
-
Referee: The provided manuscript consists solely of the abstract, with no access to methods, equations, ablation studies, experimental protocols, datasets, or baseline details. This prevents verification of whether the central claims (e.g., AP = 0.999 and robustness under compression/asynchrony) are supported by the evidence, making the soundness of the contribution impossible to assess at present.
Authors: We apologize for the submission error that resulted in only the abstract being provided for review. The complete manuscript, available on arXiv:2604.23957, includes the full methods section detailing the cross-modal watermark fusion and calibration-aware alignment, supporting equations, ablation studies on robustness to compression and asynchrony, experimental protocols, datasets, and baseline comparisons. These elements directly substantiate the reported AP = 0.999 and robustness claims. We will revise the submission to include the full manuscript for proper assessment. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper proposes the LAVA framework for audio-visual watermarking in deepfake detection, describing cross-modal fusion and calibration-aware alignment at a high level. Claims of near-perfect detection (AP=0.999) and robustness are framed as outcomes of extensive experiments rather than mathematical derivations. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text that would reduce any result to its inputs by construction. The approach addresses prior limitations through design choices validated empirically, remaining self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2019. Multi- modal Machine Learning: A Survey and Taxonomy.IEEE Transactions on Pattern Analysis and Machine Intelligence41, 2 (2019), 423–443. doi:10.1109/TPAMI.2018. 2798607
-
[2]
Zhixi Cai, Shreya Ghosh, Aman Pankaj Adatia, Munawar Hayat, Abhinav Dhall, Tom Gedeon, and Kalin Stefanov. 2024. AV-Deepfake1M: A Large- Scale LLM-Driven Audio-Visual Deepfake Dataset. InProceedings of the 32nd ACM International Conference on Multimedia(Melbourne VIC, Australia)(MM ’24). Association for Computing Machinery, New York, NY, USA, 7414–7423. d...
-
[3]
Zhixi Cai, Shreya Ghosh, Abhinav Dhall, Tom Gedeon, Kalin Stefanov, and Munawar Hayat. 2023. Glitch in the matrix: A large scale benchmark for content driven audio–visual forgery detection and localization.Computer Vision and Image Understanding236 (2023), 103818. doi:10.1016/j.cviu.2023.103818
- [4]
-
[5]
Komal Chugh, Parul Gupta, Abhinav Dhall, and Ramanathan Subramanian. 2020. Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization. InProceedings of the 28th ACM International Conference on Multimedia(Seattle, WA, USA)(MM ’20). Association for Computing Machinery, New York, NY, USA, 439–447. doi:10.1145/3394171.3413700
-
[6]
Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. InInterspeech 2018 (interspeech 2018). ISCA, 1086–1090. doi:10.21437/interspeech.2018-1929
-
[7]
Pierre Fernandez, Guillaume Couairon, Hervé Jégou, Matthijs Douze, and Teddy Furon. 2023. The Stable Signature: Rooting Watermarks in Latent Diffusion Models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 22466–22477
2023
-
[8]
Zeki Yalniz, and Alexandre Mourachko
Pierre Fernandez, Hady Elsahar, I. Zeki Yalniz, and Alexandre Mourachko. 2024. Video Seal: Open and Efficient Video Watermarking. arXiv:2412.09492 [cs.MM] https://arxiv.org/abs/2412.09492
-
[9]
Weinberger
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration of Modern Neural Networks. InProceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 1321–1330. https://proceedings.mlr. press/v70/guo17a.html
2017
-
[10]
Alexandros Haliassos, Rodrigo Mira, Stavros Petridis, and Maja Pantic. 2022. Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detec- tion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14950–14962
2022
-
[11]
Anil Jain, Karthik Nandakumar, and Arun Ross. 2005. Score normalization in multimodal biometric systems.Pattern Recognition38, 12 (2005), 2270–2285. doi:10.1016/j.patcog.2005.01.012
-
[12]
Zhaoyang Jia, Han Fang, and Weiming Zhang. 2021. MBRS: Enhancing Robust- ness of DNN-based Watermarking by Mini-Batch of Real and Simulated JPEG Compression. InProceedings of the 29th ACM International Conference on Multi- media(Virtual Event, China)(MM ’21). Association for Computing Machinery, New York, NY, USA, 41–49. doi:10.1145/3474085.3475324
-
[13]
Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong- Jin Lee, Ha-Jin Yu, and Nicholas Evans. 2022. AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6367–6371. doi:10.1109/ICASSP43922.2022.9747766
- [14]
-
[15]
Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, and Baining Guo. 2020. Face x-ray for more general face forgery detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5001–5010
2020
-
[16]
Eugene T Lin and Edward J Delp. 1999. A review of fragile image watermarks. InProceedings of the ACM Multimedia Security Workshop. 47–51
1999
-
[17]
Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. 2015. Obtain- ing well calibrated probabilities using bayesian binning. InProceedings of the AAAI conference on artificial intelligence, Vol. 29
2015
-
[18]
Paarth Neekhara, Shehzeen Hussain, Xinqiao Zhang, Ke Huang, Julian McAuley, and Farinaz Koushanfar. 2024. FaceSigns: Semi-fragile watermarks for media authentication.ACM Transactions on Multimedia Computing, Communications and Applications20, 11 (2024), 1–21
2024
-
[19]
Gan Pei, Jiangning Zhang, Menghan Hu, Zhenyu Zhang, Chengjie Wang, Yun- sheng Wu, Guangtao Zhai, Jian Yang, and Dacheng Tao. 2026. Deepfake Genera- tion and Detection: A Benchmark and Survey.ACM Comput. Surv.(March 2026). doi:10.1145/3801962 Just Accepted
-
[20]
Namboodiri, and C.V
K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V. Jawahar
-
[21]
InProceedings of the 28th ACM International Conference on Multimedia (Seattle, WA, USA)(MM ’20)
A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. InProceedings of the 28th ACM International Conference on Multimedia (Seattle, WA, USA)(MM ’20). Association for Computing Machinery, New York, NY, USA, 484–492. doi:10.1145/3394171.3413532
-
[22]
Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. 2020. Thinking in Frequency: Face Forgery Detection by Mining Frequency-Aware Clues. In Computer Vision – ECCV 2020, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer International Publishing, Cham, 86–103
2020
-
[23]
Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. 2019. Faceforensics++: Learning to detect manipulated facial images. InProceedings of the IEEE/CVF international conference on computer vision. 1–11
2019
-
[24]
Robin San Roman, Pierre Fernandez, Hady Elsahar, Alexandre Défossez, Teddy Furon, and Tuan Tran. 2024. Proactive Detection of Voice Cloning with Localized Watermarking. InInternational Conference on Machine Learning. PMLR, 43180– 43196
2024
-
[25]
Tom Sander, Pierre Fernandez, Alain Oliviero Durmus, Teddy Furon, and Matthijs Douze. 2025. Watermark Anything With Localized Messages. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/ forum?id=IkZVDzdC8M
2025
-
[26]
Kaede Shiohara and Toshihiko Yamasaki. 2022. Detecting Deepfakes With Self- Blended Images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18720–18729
2022
-
[27]
Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, and Anthony Larcher. 2021. End-to-End anti-spoofing with RawNet2. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6369–6373. doi:10.1109/ICASSP39728.2021.9414234
- [28]
-
[29]
Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. InInternational conference on machine learning. PMLR, 6105–6114
2019
-
[30]
Matthew Tancik, Ben Mildenhall, and Ren Ng. 2020. Stegastamp: Invisible hyperlinks in physical photographs. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2117–2126
2020
-
[31]
Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Héctor Delgado, An- dreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, et al. 2020. ASVspoof 2019: A large-scale public database of syn- thesized, converted and replayed speech.Computer Speech & Language64 (2020), 101114
2020
-
[32]
Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, and Houqiang Li
-
[33]
InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
AltFreezing for More General Video Face Forgery Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4129–4138. Bokang Zeng, Zheng Gao, Xiaoyu Li, Xiaoyan Feng, and Jiaojiao Jiang
-
[34]
Yuxin Wen, John Kirchenbauer, Jonas Geiping, and Tom Goldstein. 2023. Tree- rings watermarks: Invisible fingerprints for diffusion images.Advances in Neural Information Processing Systems36 (2023), 58047–58063
2023
-
[35]
Zhiyuan Yan, Yong Zhang, Yanbo Fan, and Baoyuan Wu. 2023. UCF: Uncovering Common Features for Generalizable Deepfake Detection. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 22412–22423
2023
-
[36]
Xuanyu Zhang, Runyi Li, Jiwen Yu, Youmin Xu, Weiqi Li, and Jian Zhang. 2024. EditGuard: Versatile Image Watermarking for Tamper Localization and Copyright Protection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11964–11974
2024
-
[37]
Xuanyu Zhang, Zecheng Tang, Zhipei Xu, Runyi Li, Youmin Xu, Bin Chen, Feng Gao, and Jian Zhang. 2025. OmniGuard: Hybrid Manipulation Localization via Augmented Versatile Deep Image Watermarking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3008–3018
2025
-
[38]
Yinglin Zheng, Jianmin Bao, Dong Chen, Ming Zeng, and Fang Wen. 2021. Ex- ploring Temporal Coherence for More General Video Face Forgery Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 15044–15054
2021
-
[39]
Yipin Zhou and Ser-Nam Lim. 2021. Joint audio-visual deepfake detection. In Proceedings of the IEEE/CVF international conference on computer vision. 14800– 14809
2021
-
[40]
Jiren Zhu, Russell Kaplan, Justin Johnson, and Li Fei-Fei. 2018. Hidden: Hiding data with deep networks. InProceedings of the European conference on computer vision (ECCV). 657–672
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.