MixFake is a new benchmark for mixed-authenticity audio and a multi-stream prompt tuning method achieves 0.95% EER foreground and 7.72% absolute gain in complex background deepfake detection.
XLS-R: Self-supervised cross-lingual speech represen- tation learning at scale
7 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
PVP models speaker-specific phoneme acoustic distributions with lightweight GMMs trained only on real speech to detect deepfakes of persons-of-interest, outperforming generic detectors and introducing a new Chinese POI dataset.
Multilingual ASR models show 39.7-297% zero-shot WER on Pashto public data, Whisper models output correct script in under 0.8% of cases, and fine-tuned models degrade to 32.5-59% WER on out-of-domain sets.
Spoof-SUPERB benchmark shows large-scale discriminative SSL models such as XLS-R, UniSpeech-SAT, and WavLM Large outperform others in audio deepfake detection and maintain robustness under acoustic degradations.
Introduces forensic similarity for speech deepfakes via a Siamese feature extractor and similarity network to verify shared forensic traces and source models between audio segments.
Cosine similarity in SupCon with a delayed negative queue on wav2vec2 XLS-R yields the lowest equal error rates for deepfake audio detection on in-the-wild and pooled evaluations.
A bilingual TTS system for the Peruvian Constitution in Quechua and Spanish is developed with XTTS v2, F5-TTS, and DiFlow-TTS, releasing checkpoints and audio to support low-resource speech synthesis.
citing papers explorer
-
MixFake: Benchmarking and Enhancing Audio Deepfake Detection in Diverse Real-world Mixed Audio
MixFake is a new benchmark for mixed-authenticity audio and a multi-stream prompt tuning method achieves 0.95% EER foreground and 7.72% absolute gain in complex background deepfake detection.
-
Profiling the Voice: Speaker-Specific Phoneme Fingerprinting for Speech Deepfake Detection
PVP models speaker-specific phoneme acoustic distributions with lightweight GMMs trained only on real speech to detect deepfakes of persons-of-interest, outperforming generic detectors and introducing a new Chinese POI dataset.
-
Benchmarking Multilingual Speech Models on Pashto: Zero-Shot ASR, Script Failure, and Cross-Domain Evaluation
Multilingual ASR models show 39.7-297% zero-shot WER on Pashto public data, Whisper models output correct script in under 0.8% of cases, and fine-tuned models degrade to 32.5-59% WER on out-of-domain sets.
-
A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection
Spoof-SUPERB benchmark shows large-scale discriminative SSL models such as XLS-R, UniSpeech-SAT, and WavLM Large outperform others in audio deepfake detection and maintain robustness under acoustic degradations.
-
Forensic Similarity for Speech Deepfakes
Introduces forensic similarity for speech deepfakes via a Siamese feature extractor and similarity network to verify shared forensic traces and source models between audio segments.
-
Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection
Cosine similarity in SupCon with a delayed negative queue on wav2vec2 XLS-R yields the lowest equal error rates for deepfake audio detection on in-the-wild and pooled evaluations.
-
Giving Voice to the Constitution: Low-Resource Text-to-Speech for Quechua and Spanish Using a Bilingual Legal Corpus
A bilingual TTS system for the Peruvian Constitution in Quechua and Spanish is developed with XTTS v2, F5-TTS, and DiFlow-TTS, releasing checkpoints and audio to support low-resource speech synthesis.