End-to-End Spectro-Temporal Graph Attention Networks for Speaker Verification Anti-Spoofing and Speech Deepfake Detection

Hemlata Tak; Jee-weon Jung; Jose Patino; Madhu Kamble; Massimiliano Todisco; Nicholas Evans

arxiv: 2107.12710 · v2 · pith:3KZKBTTWnew · submitted 2021-07-27 · 📡 eess.AS · cs.SD

End-to-End Spectro-Temporal Graph Attention Networks for Speaker Verification Anti-Spoofing and Speech Deepfake Detection

Hemlata Tak , Jee-weon Jung , Jose Patino , Madhu Kamble , Massimiliano Todisco , Nicholas Evans This is my paper

classification 📡 eess.AS cs.SD

keywords graphdetectionfusionmodelspeechtemporalartefactsattention

0 comments

read the original abstract

Artefacts that serve to distinguish bona fide speech from spoofed or deepfake speech are known to reside in specific subbands and temporal segments. Various approaches can be used to capture and model such artefacts, however, none works well across a spectrum of diverse spoofing attacks. Reliable detection then often depends upon the fusion of multiple detection systems, each tuned to detect different forms of attack. In this paper we show that better performance can be achieved when the fusion is performed within the model itself and when the representation is learned automatically from raw waveform inputs. The principal contribution is a spectro-temporal graph attention network (GAT) which learns the relationship between cues spanning different sub-bands and temporal intervals. Using a model-level graph fusion of spectral (S) and temporal (T) sub-graphs and a graph pooling strategy to improve discrimination, the proposed RawGAT-ST model achieves an equal error rate of 1.06 % for the ASVspoof 2019 logical access database. This is one of the best results reported to date and is reproducible using an open source implementation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Profiling the Voice: Speaker-Specific Phoneme Fingerprinting for Speech Deepfake Detection
cs.SD 2026-05 unverdicted novelty 7.0

PVP models speaker-specific phoneme acoustic distributions with lightweight GMMs trained only on real speech to detect deepfakes of persons-of-interest, outperforming generic detectors and introducing a new Chinese PO...
The DeepSpeak Dataset
cs.CV 2024-08 unverdicted novelty 7.0

DeepSpeak provides over 100 hours of consented, identity-matched real and modern deepfake audiovisual content focused on talking heads, with evaluations showing existing detectors fail to generalize without retraining.
MLAAD: The Multi-Language Audio Anti-Spoofing Dataset
cs.SD 2024-01 unverdicted novelty 6.0

MLAAD provides a large-scale multi-language synthetic audio dataset for training and evaluating audio anti-spoofing models, showing better training performance than InTheWild and FakeOrReal and alternating superiority...
Detecting Audio Deepfakes on the Edge:Lightweight SSL-Based Detection in a Browser Plugin
eess.AS 2026-06 unverdicted novelty 4.0

Truncated SSL backbone with logistic classifier detects audio deepfakes on-device, claimed to outperform AASIST by 10% while running 40% faster, packaged as a browser plugin.