Proactive Detection of Voice Cloning with Localized Watermarking

Alexandre D\'efossez; Hady Elsahar; Pierre Fernandez; Robin San Roman; Teddy Furon; Tuan Tran

arxiv: 2401.17264 · v2 · pith:7SKGRJSYnew · submitted 2024-01-30 · 💻 cs.SD · cs.AI· cs.CR

Proactive Detection of Voice Cloning with Localized Watermarking

Robin San Roman , Pierre Fernandez , Alexandre D\'efossez , Teddy Furon , Tuan Tran , Hady Elsahar This is my paper

classification 💻 cs.SD cs.AIcs.CR

keywords audiosealdetectionaudiolocalizedcloningdesigneddetectorimperceptibility

0 comments

read the original abstract

In the rapidly evolving field of speech generative models, there is a pressing need to ensure audio authenticity against the risks of voice cloning. We present AudioSeal, the first audio watermarking technique designed specifically for localized detection of AI-generated speech. AudioSeal employs a generator/detector architecture trained jointly with a localization loss to enable localized watermark detection up to the sample level, and a novel perceptual loss inspired by auditory masking, that enables AudioSeal to achieve better imperceptibility. AudioSeal achieves state-of-the-art performance in terms of robustness to real life audio manipulations and imperceptibility based on automatic and human evaluation metrics. Additionally, AudioSeal is designed with a fast, single-pass detector, that significantly surpasses existing models in speed - achieving detection up to two orders of magnitude faster, making it ideal for large-scale and real-time applications.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Watermark Shortcut: How Provenance Marking Sabotages Audio Deepfake Detection
cs.SD 2026-06 unverdicted novelty 8.0

Watermarking only synthetic audio leads deepfake detectors to use the watermark as a spurious shortcut, causing generalization failure, evasion by removing watermarks, and false positives on watermarked real audio.
MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech
cs.SD 2026-05 unverdicted novelty 7.0

MelShield adds keyed low-energy spread-spectrum perturbations to Mel-spectrograms inside TTS pipelines before vocoding to enable robust extraction of user-specific attribution signals even after compression or noise.
Audio Pirates: Black-box Audio Watermark Removal via Diffusion Priors
cs.CR 2026-05 unverdicted novelty 6.0

DiffErase removes black-box audio watermarks via diffusion priors by adding intermediate noise and regenerating with a pretrained model, preserving quality across audio domains.
Who Gets Flagged? The Pluralistic Evaluation Gap in AI Content Watermarking
cs.CY 2026-04 conditional novelty 6.0

AI content watermarking exhibits detection disparities across languages, cultures, and demographics due to content-dependent signal properties, with benchmarks failing to disaggregate performance and watermarking held...