ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection
read the original abstract
ASVspoof 2021 is the forth edition in the series of bi-annual challenges which aim to promote the study of spoofing and the design of countermeasures to protect automatic speaker verification systems from manipulation. In addition to a continued focus upon logical and physical access tasks in which there are a number of advances compared to previous editions, ASVspoof 2021 introduces a new task involving deepfake speech detection. This paper describes all three tasks, the new databases for each of them, the evaluation metrics, four challenge baselines, the evaluation platform and a summary of challenge results. Despite the introduction of channel and compression variability which compound the difficulty, results for the logical access and deepfake tasks are close to those from previous ASVspoof editions. Results for the physical access task show the difficulty in detecting attacks in real, variable physical spaces. With ASVspoof 2021 being the first edition for which participants were not provided with any matched training or development data and with this reflecting real conditions in which the nature of spoofed and deepfake speech can never be predicated with confidence, the results are extremely encouraging and demonstrate the substantial progress made in the field in recent years.
This paper has not been read by Pith yet.
Forward citations
Cited by 8 Pith papers
-
The Watermark Shortcut: How Provenance Marking Sabotages Audio Deepfake Detection
Watermarking only synthetic audio leads deepfake detectors to use the watermark as a spurious shortcut, causing generalization failure, evasion by removing watermarks, and false positives on watermarked real audio.
-
Linguistically Augmented Audio Speech Data (LinguAS)
Introduces the LinguAS dataset of genuine and deepfaked audio annotated with expert-defined linguistic features to improve detection model performance over ASVspoof 2021 and SSL baselines.
-
EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection
EchoFake is a new replay-aware dataset combining zero-shot TTS deepfakes and physical replay recordings to improve generalization of speech deepfake detection models over existing lab-focused datasets.
-
Alethia: A Foundational Encoder for Voice Deepfakes
Alethia is a pretrained audio encoder using continuous embedding prediction and generative flow-matching reconstruction that outperforms existing speech foundation models on voice deepfake tasks with better robustness...
-
MLAAD: The Multi-Language Audio Anti-Spoofing Dataset
MLAAD provides a large-scale multi-language synthetic audio dataset for training and evaluating audio anti-spoofing models, showing better training performance than InTheWild and FakeOrReal and alternating superiority...
-
Gender Fairness in Audio Deepfake Detection: Performance and Disparity Analysis
Fairness metrics uncover gender disparities in audio deepfake detection error distributions that standard Equal Error Rate metrics obscure.
-
Detecting Audio Deepfakes on the Edge:Lightweight SSL-Based Detection in a Browser Plugin
Truncated SSL backbone with logistic classifier detects audio deepfakes on-device, claimed to outperform AASIST by 10% while running 40% faster, packaged as a browser plugin.
-
SpAArSIST: Sparsified AASIST for Efficient and Reliable Anti-Spoofing
SpAArSIST sparsifies AASIST by swapping learned pooling for explicit magnitude-based scoring and mean aggregation, cutting compute 20.7% and improving In-the-Wild EER to 2.82%.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.