arxiv: 2604.23957 · v1 · submitted 2026-04-27 · 💻 cs.CV

LAVA: Layered Audio-Visual Anti-tampering Watermarking for Robust Deepfake Detection and Localization

Bokang Zeng , Zheng Gao , Xiaoyu Li , Xiaoyan Feng , Jiaojiao Jiang This is my paper

Pith reviewed 2026-05-08 04:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords deepfake detectionwatermarkingaudio-visual fusiontamper localizationrobust watermarkingvideo compressiondeepfake localization

0 comments

The pith

LAVA fuses watermarks across audio and visual layers to maintain deepfake detection and localization under compression and desynchronization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes LAVA as a proactive watermarking system that embeds signals in both audio and visual tracks of short videos and then fuses them with calibration-aware alignment. This design aims to keep tamper evidence intact even when videos undergo codec compression or lose audio-visual timing sync, problems that break earlier decoupled or fragile watermark methods. A reader would care because deepfake videos now circulate at scale on social platforms where compression and editing are routine, so reliable preemptive detection could limit their spread. If the approach holds, it would let platforms or creators verify authenticity without waiting for post-hoc forensic analysis. The experiments claim this yields near-perfect detection scores while outperforming prior audio-visual baselines on localization tasks.

Core claim

LAVA is a calibration-aware audio-visual watermark fusion framework that leverages cross-modal watermark fusion and calibration-aware alignment to preserve consistent and reliable tamper evidence under compression and audio-visual asynchrony, enabling robust tamper detection and localization.

What carries the argument

The layered audio-visual anti-tampering watermarking framework, which embeds and fuses signals in audio and visual layers with calibration to survive degradations.

If this is right

Detection average precision reaches 0.999 on the evaluated deepfake datasets.
Tamper localization becomes more reliable than existing audio-visual fusion baselines.
Performance holds under common compression codecs and moderate multimodal misalignment.
The method avoids the frequency-band overlap that causes earlier visual watermarks to fail under compression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Platforms could embed LAVA-style watermarks at upload time to create an authenticity layer for short-form video.
The same fusion idea might apply to other paired modalities such as video and text subtitles.
Adversaries might develop targeted attacks that simultaneously degrade both audio and visual watermark channels.
Widespread adoption would shift verification from reactive forensics to proactive content signing.

Load-bearing premise

Cross-modal watermark fusion and alignment can keep tamper signals consistent and detectable after real-world codec compression and audio-visual timing shifts.

What would settle it

Run LAVA on a test set of videos compressed with H.264 or VP9 at low bitrates and with added audio delays of 100-500 ms; if average precision falls substantially below 0.999 or localization IoU drops, the claim fails.

Figures

Figures reproduced from arXiv: 2604.23957 by Bokang Zeng, Jiaojiao Jiang, Xiaoyan Feng, Xiaoyu Li, Zheng Gao.

**Figure 1.** Figure 1: Overview of the LAVA pipeline. Independent visual and audio integrity watermarks are embedded before distribution. view at source ↗

**Figure 2.** Figure 2: Spatiotemporal detection on LAV-DF. Top: per view at source ↗

read the original abstract

Proactive watermarking offers a promising approach for deepfake tamper detection and localization in short-form videos. However, existing methods often decouple audio and visual evidence and assume that watermark signals remain reliable under real-world degradations, making tamper localization vulnerable to multimodal misalignment and compression distortions. Moreover, existing semi-fragile visual watermarking methods often degrade significantly under codec compression because their embedding bands overlap with compression-sensitive frequency regions. To address these limitations, we propose Layered Audio-Visual Anti-tampering Watermarking (LAVA), a calibration-aware audio-visual watermark fusion framework for deepfake tamper detection and localization. LAVA leverages cross-modal watermark fusion and calibration-aware alignment to preserve consistent and reliable tamper evidence under compression and audio-visual asynchrony, enabling robust tamper localization. Extensive experiments demonstrate that LAVA achieves near-perfect detection performance (AP = 0.999), remains robust to compression and multimodal misalignment, and significantly improves tamper localization reliability over existing audio-visual fusion baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LAVA adds a calibration-aware audio-visual fusion layer to watermarking that targets compression and asynchrony, but the 0.999 AP claim rests on details we can't check from the abstract alone.

read the letter

The core idea is straightforward: existing watermark methods either split audio and video or embed in bands that codecs destroy, so LAVA fuses them with an alignment step that keeps the tamper signal consistent when the streams drift or get compressed. That addresses documented weak points in prior semi-fragile schemes without inventing new primitives from scratch. The abstract shows they ran experiments on short-form video and report strong detection plus better localization than audio-visual baselines, which is the practical part worth noticing if the numbers hold up under the same conditions others use. What the work does cleanly is name the frequency-overlap and misalignment problems and propose a single framework to handle both instead of bolting fixes on separately. The cross-modal calibration step is the piece that feels new relative to decoupled approaches. On the downside, the abstract gives no equations for the fusion or alignment, no ablation tables, and no list of exact baselines or datasets, so the 0.999 AP and robustness claims can't be stress-tested yet. High detection scores in this area often shrink once you vary codec settings or add real benign edits, and without those controls it's hard to know how much the calibration actually buys. The paper is aimed at people building proactive detection tools for platforms that deal with compressed, misaligned clips. A reader who already knows the watermarking literature will see the incremental engineering value quickly. It deserves a serious referee because the motivation lines up with real deployment gaps and the high-level design is coherent, even if the current evidence is thin. I'd send it out for review rather than desk-reject.

Referee Report

1 major / 0 minor

Summary. The paper proposes LAVA, a calibration-aware audio-visual watermark fusion framework for deepfake tamper detection and localization in short-form videos. It addresses limitations in prior methods that decouple audio and visual signals and are vulnerable to multimodal misalignment and codec compression by using cross-modal watermark fusion and calibration-aware alignment to preserve tamper evidence. The work reports near-perfect detection (AP = 0.999), robustness to compression and asynchrony, and improved localization over audio-visual fusion baselines.

Significance. If the reported results and robustness claims hold under full experimental scrutiny, this represents a meaningful advance in proactive watermarking for multimodal deepfake mitigation. The emphasis on handling real-world degradations like compression and asynchrony directly targets documented weaknesses in semi-fragile visual watermarking and could support more reliable tamper localization in practical video authenticity systems.

major comments (1)

The provided manuscript consists solely of the abstract, with no access to methods, equations, ablation studies, experimental protocols, datasets, or baseline details. This prevents verification of whether the central claims (e.g., AP = 0.999 and robustness under compression/asynchrony) are supported by the evidence, making the soundness of the contribution impossible to assess at present.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for acknowledging the potential significance of LAVA in addressing real-world challenges in multimodal deepfake mitigation. We address the major comment point by point below.

read point-by-point responses

Referee: The provided manuscript consists solely of the abstract, with no access to methods, equations, ablation studies, experimental protocols, datasets, or baseline details. This prevents verification of whether the central claims (e.g., AP = 0.999 and robustness under compression/asynchrony) are supported by the evidence, making the soundness of the contribution impossible to assess at present.

Authors: We apologize for the submission error that resulted in only the abstract being provided for review. The complete manuscript, available on arXiv:2604.23957, includes the full methods section detailing the cross-modal watermark fusion and calibration-aware alignment, supporting equations, ablation studies on robustness to compression and asynchrony, experimental protocols, datasets, and baseline comparisons. These elements directly substantiate the reported AP = 0.999 and robustness claims. We will revise the submission to include the full manuscript for proper assessment. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes the LAVA framework for audio-visual watermarking in deepfake detection, describing cross-modal fusion and calibration-aware alignment at a high level. Claims of near-perfect detection (AP=0.999) and robustness are framed as outcomes of extensive experiments rather than mathematical derivations. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text that would reduce any result to its inputs by construction. The approach addresses prior limitations through design choices validated empirically, remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no specific free parameters, axioms, or invented entities can be extracted or evaluated.

pith-pipeline@v0.9.0 · 5484 in / 1184 out tokens · 76126 ms · 2026-05-08T04:55:40.479683+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 15 canonical work pages

[1]

Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2019. Multi- modal Machine Learning: A Survey and Taxonomy.IEEE Transactions on Pattern Analysis and Machine Intelligence41, 2 (2019), 423–443. doi:10.1109/TPAMI.2018. 2798607

work page doi:10.1109/tpami.2018 2019
[2]

Zhixi Cai, Shreya Ghosh, Aman Pankaj Adatia, Munawar Hayat, Abhinav Dhall, Tom Gedeon, and Kalin Stefanov. 2024. AV-Deepfake1M: A Large- Scale LLM-Driven Audio-Visual Deepfake Dataset. InProceedings of the 32nd ACM International Conference on Multimedia(Melbourne VIC, Australia)(MM ’24). Association for Computing Machinery, New York, NY, USA, 7414–7423. d...

work page doi:10.1145/3664647.3680795 2024
[3]

Zhixi Cai, Shreya Ghosh, Abhinav Dhall, Tom Gedeon, Kalin Stefanov, and Munawar Hayat. 2023. Glitch in the matrix: A large scale benchmark for content driven audio–visual forgery detection and localization.Computer Vision and Image Understanding236 (2023), 103818. doi:10.1016/j.cviu.2023.103818

work page doi:10.1016/j.cviu.2023.103818 2023
[4]

Guangyu Chen, Yu Wu, Shujie Liu, Tao Liu, Xiaoyong Du, and Furu Wei. 2024. WavMark: Watermarking for Audio Generation. arXiv:2308.12770 [cs.SD] https: //arxiv.org/abs/2308.12770

work page arXiv 2024
[5]

Komal Chugh, Parul Gupta, Abhinav Dhall, and Ramanathan Subramanian. 2020. Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization. InProceedings of the 28th ACM International Conference on Multimedia(Seattle, WA, USA)(MM ’20). Association for Computing Machinery, New York, NY, USA, 439–447. doi:10.1145/3394171.3413700

work page doi:10.1145/3394171.3413700 2020
[6]

Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. InInterspeech 2018 (interspeech 2018). ISCA, 1086–1090. doi:10.21437/interspeech.2018-1929

work page doi:10.21437/interspeech.2018-1929 2018
[7]

Pierre Fernandez, Guillaume Couairon, Hervé Jégou, Matthijs Douze, and Teddy Furon. 2023. The Stable Signature: Rooting Watermarks in Latent Diffusion Models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 22466–22477

2023
[8]

Zeki Yalniz, and Alexandre Mourachko

Pierre Fernandez, Hady Elsahar, I. Zeki Yalniz, and Alexandre Mourachko. 2024. Video Seal: Open and Efficient Video Watermarking. arXiv:2412.09492 [cs.MM] https://arxiv.org/abs/2412.09492

work page arXiv 2024
[9]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration of Modern Neural Networks. InProceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 1321–1330. https://proceedings.mlr. press/v70/guo17a.html

2017
[10]

Alexandros Haliassos, Rodrigo Mira, Stavros Petridis, and Maja Pantic. 2022. Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detec- tion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14950–14962

2022
[11]

Anil Jain, Karthik Nandakumar, and Arun Ross. 2005. Score normalization in multimodal biometric systems.Pattern Recognition38, 12 (2005), 2270–2285. doi:10.1016/j.patcog.2005.01.012

work page doi:10.1016/j.patcog.2005.01.012 2005
[12]

Zhaoyang Jia, Han Fang, and Weiming Zhang. 2021. MBRS: Enhancing Robust- ness of DNN-based Watermarking by Mini-Batch of Real and Simulated JPEG Compression. InProceedings of the 29th ACM International Conference on Multi- media(Virtual Event, China)(MM ’21). Association for Computing Machinery, New York, NY, USA, 41–49. doi:10.1145/3474085.3475324

work page doi:10.1145/3474085.3475324 2021
[13]

Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong- Jin Lee, Ha-Jin Yu, and Nicholas Evans. 2022. AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6367–6371. doi:10.1109/ICASSP43922.2022.9747766

work page doi:10.1109/icassp43922.2022.9747766 2022
[14]

Hasam Khalid, Shahroz Tariq, Minha Kim, and Simon S. Woo. 2022. FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset. arXiv:2108.05080 [cs.CV] https://arxiv.org/abs/2108.05080

work page arXiv 2022
[15]

Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, and Baining Guo. 2020. Face x-ray for more general face forgery detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5001–5010

2020
[16]

Eugene T Lin and Edward J Delp. 1999. A review of fragile image watermarks. InProceedings of the ACM Multimedia Security Workshop. 47–51

1999
[17]

Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. 2015. Obtain- ing well calibrated probabilities using bayesian binning. InProceedings of the AAAI conference on artificial intelligence, Vol. 29

2015
[18]

Paarth Neekhara, Shehzeen Hussain, Xinqiao Zhang, Ke Huang, Julian McAuley, and Farinaz Koushanfar. 2024. FaceSigns: Semi-fragile watermarks for media authentication.ACM Transactions on Multimedia Computing, Communications and Applications20, 11 (2024), 1–21

2024
[19]

Gan Pei, Jiangning Zhang, Menghan Hu, Zhenyu Zhang, Chengjie Wang, Yun- sheng Wu, Guangtao Zhai, Jian Yang, and Dacheng Tao. 2026. Deepfake Genera- tion and Detection: A Benchmark and Survey.ACM Comput. Surv.(March 2026). doi:10.1145/3801962 Just Accepted

work page doi:10.1145/3801962 2026
[20]

Namboodiri, and C.V

K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V. Jawahar
[21]

InProceedings of the 28th ACM International Conference on Multimedia (Seattle, WA, USA)(MM ’20)

A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. InProceedings of the 28th ACM International Conference on Multimedia (Seattle, WA, USA)(MM ’20). Association for Computing Machinery, New York, NY, USA, 484–492. doi:10.1145/3394171.3413532

work page doi:10.1145/3394171.3413532
[22]

Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. 2020. Thinking in Frequency: Face Forgery Detection by Mining Frequency-Aware Clues. In Computer Vision – ECCV 2020, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer International Publishing, Cham, 86–103

2020
[23]

Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. 2019. Faceforensics++: Learning to detect manipulated facial images. InProceedings of the IEEE/CVF international conference on computer vision. 1–11

2019
[24]

Robin San Roman, Pierre Fernandez, Hady Elsahar, Alexandre Défossez, Teddy Furon, and Tuan Tran. 2024. Proactive Detection of Voice Cloning with Localized Watermarking. InInternational Conference on Machine Learning. PMLR, 43180– 43196

2024
[25]

Tom Sander, Pierre Fernandez, Alain Oliviero Durmus, Teddy Furon, and Matthijs Douze. 2025. Watermark Anything With Localized Messages. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/ forum?id=IkZVDzdC8M

2025
[26]

Kaede Shiohara and Toshihiko Yamasaki. 2022. Detecting Deepfakes With Self- Blended Images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18720–18729

2022
[27]

Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, and Anthony Larcher. 2021. End-to-End anti-spoofing with RawNet2. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6369–6373. doi:10.1109/ICASSP39728.2021.9414234

work page doi:10.1109/icassp39728.2021.9414234 2021
[28]

Hemlata Tak, Massimiliano Todisco, Xin Wang, Jee weon Jung, Junichi Yamagishi, and Nicholas Evans. 2022. Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation. arXiv:2202.12233 [eess.AS] https://arxiv.org/abs/2202.12233

work page arXiv 2022
[29]

Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. InInternational conference on machine learning. PMLR, 6105–6114

2019
[30]

Matthew Tancik, Ben Mildenhall, and Ren Ng. 2020. Stegastamp: Invisible hyperlinks in physical photographs. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2117–2126

2020
[31]

Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Héctor Delgado, An- dreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, et al. 2020. ASVspoof 2019: A large-scale public database of syn- thesized, converted and replayed speech.Computer Speech & Language64 (2020), 101114

2020
[32]

Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, and Houqiang Li
[33]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

AltFreezing for More General Video Face Forgery Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4129–4138. Bokang Zeng, Zheng Gao, Xiaoyu Li, Xiaoyan Feng, and Jiaojiao Jiang
[34]

Yuxin Wen, John Kirchenbauer, Jonas Geiping, and Tom Goldstein. 2023. Tree- rings watermarks: Invisible fingerprints for diffusion images.Advances in Neural Information Processing Systems36 (2023), 58047–58063

2023
[35]

Zhiyuan Yan, Yong Zhang, Yanbo Fan, and Baoyuan Wu. 2023. UCF: Uncovering Common Features for Generalizable Deepfake Detection. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 22412–22423

2023
[36]

Xuanyu Zhang, Runyi Li, Jiwen Yu, Youmin Xu, Weiqi Li, and Jian Zhang. 2024. EditGuard: Versatile Image Watermarking for Tamper Localization and Copyright Protection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11964–11974

2024
[37]

Xuanyu Zhang, Zecheng Tang, Zhipei Xu, Runyi Li, Youmin Xu, Bin Chen, Feng Gao, and Jian Zhang. 2025. OmniGuard: Hybrid Manipulation Localization via Augmented Versatile Deep Image Watermarking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3008–3018

2025
[38]

Yinglin Zheng, Jianmin Bao, Dong Chen, Ming Zeng, and Fang Wen. 2021. Ex- ploring Temporal Coherence for More General Video Face Forgery Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 15044–15054

2021
[39]

Yipin Zhou and Ser-Nam Lim. 2021. Joint audio-visual deepfake detection. In Proceedings of the IEEE/CVF international conference on computer vision. 14800– 14809

2021
[40]

Jiren Zhu, Russell Kaplan, Justin Johnson, and Li Fei-Fei. 2018. Hidden: Hiding data with deep networks. InProceedings of the European conference on computer vision (ECCV). 657–672

2018