The DKU Replay Detection System for the ASVspoof 2019 Challenge: On Data Augmentation, Feature Representation, Classification, and Fusion
Pith reviewed 2026-05-25 02:11 UTC · model grok-4.3
The pith
A residual neural network trained on speed-perturbed group delay grams detects replay attacks at 1.08% equal error rate.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an utterance-level residual neural network framework, when trained on speed-perturbed group delay gram features extracted from the phase spectrum, provides effective detection of physical access replay attacks, as shown by equal error rates of 1.04% on development data and 1.08% on evaluation data; simple score averaging across multiple such systems further reduces these rates to 0.24% and 0.66% respectively.
What carries the argument
The utterance-level residual neural network that receives variable-length feature sequences and outputs utterance-level scores directly.
If this is right
- Speed perturbation applied to raw waveforms serves as effective data augmentation that improves model robustness for replay detection.
- Group delay gram features derived from the phase spectrum outperform magnitude-spectrum alternatives in this classification task.
- Simple averaging of output scores from multiple systems trained on different feature representations yields further error-rate reductions.
- The framework directly accepts variable-length inputs and produces utterance-level decisions without intermediate fixed-length segmentation.
Where Pith is reading between the lines
- The reported error rates suggest the pipeline could be paired with existing speaker verification systems to reduce successful replay attempts in deployed voice services.
- Phase-derived representations such as group delay grams may prove useful for detecting other forms of audio spoofing beyond physical-access replays.
- Testing the same architecture on replay data gathered under acoustic conditions or device sets outside the challenge would reveal whether performance depends on the specific training distribution.
Load-bearing premise
The ASVspoof 2019 challenge data splits and evaluation protocol are assumed to be representative of real-world physical access replay attacks, with no post-hoc selection or overfitting to the specific test conditions.
What would settle it
Evaluating the trained system on a fresh collection of replay attacks recorded with devices, rooms, or playback equipment absent from the ASVspoof 2019 corpus would determine whether the reported equal error rates generalize.
Figures
read the original abstract
This paper describes our DKU replay detection system for the ASVspoof 2019 challenge. The goal is to develop spoofing countermeasure for automatic speaker recognition in physical access scenario. We leverage the countermeasure system pipeline from four aspects, including the data augmentation, feature representation, classification, and fusion. First, we introduce an utterance-level deep learning framework for anti-spoofing. It receives the variable-length feature sequence and outputs the utterance-level scores directly. Based on the framework, we try out various kinds of input feature representations extracted from either the magnitude spectrum or phase spectrum. Besides, we also perform the data augmentation strategy by applying the speed perturbation on the raw waveform. Our best single system employs a residual neural network trained by the speed-perturbed group delay gram. It achieves EER of 1.04% on the development set, as well as EER of 1.08% on the evaluation set. Finally, using the simple average score from several single systems can further improve the performance. EER of 0.24% on the development set and 0.66% on the evaluation set is obtained for our primary system.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes the DKU replay detection system submitted to the ASVspoof 2019 challenge for physical-access spoofing countermeasures. It introduces an utterance-level deep neural network framework that processes variable-length feature sequences and outputs utterance-level scores, explores multiple input representations derived from magnitude or phase spectra, applies speed perturbation as data augmentation on the raw waveform, and reports results from a residual network trained on speed-perturbed group-delay grams together with simple score-level fusion of several single systems.
Significance. If the reported EER figures hold under the official protocol, the work supplies concrete evidence that phase-derived features combined with speed perturbation can yield single-system EERs below 1.1 % on both development and evaluation partitions of the ASVspoof 2019 PA task, and that straightforward fusion can push performance still lower. Such empirical benchmarks are useful for the community even when the underlying architecture is standard.
major comments (2)
- [Abstract] Abstract: the central performance claims (EER = 1.04 % on the development set and 1.08 % on the evaluation set for the best single system) are stated without any accompanying description of the training procedure, optimizer, learning-rate schedule, early-stopping criterion, or number of random seeds. Because the evaluation EER is obtained after tuning on the development set, these omissions make it impossible to judge whether the quoted figures are reproducible or the result of post-hoc selection.
- [Abstract] Abstract: no ablation or comparative table is referenced that isolates the contribution of the group-delay gram versus other phase or magnitude features, nor the incremental gain from speed perturbation. Without such controlled comparisons the claim that the residual network “trained by the speed-perturbed group delay gram” is the decisive factor remains unsupported by the reported evidence.
minor comments (1)
- [Abstract] The abstract states that “several single systems” are fused by simple averaging but does not indicate how many systems, which feature sets, or whether the fusion weights were tuned on the development set.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our ASVspoof 2019 PA system description. We address the two major comments point-by-point below and will revise the manuscript to improve clarity and support for the reported claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central performance claims (EER = 1.04 % on the development set and 1.08 % on the evaluation set for the best single system) are stated without any accompanying description of the training procedure, optimizer, learning-rate schedule, early-stopping criterion, or number of random seeds. Because the evaluation EER is obtained after tuning on the development set, these omissions make it impossible to judge whether the quoted figures are reproducible or the result of post-hoc selection.
Authors: We agree that the abstract lacks sufficient experimental-setup details to support reproducibility claims. In the revised version we will add a concise sentence summarizing the optimizer (Adam), initial learning rate, schedule, and early-stopping rule used for the ResNet, while keeping the abstract within length limits. Full hyper-parameter values and the number of random seeds are already stated in Section 3.2; the revision will simply make this information visible from the abstract. revision: yes
-
Referee: [Abstract] Abstract: no ablation or comparative table is referenced that isolates the contribution of the group-delay gram versus other phase or magnitude features, nor the incremental gain from speed perturbation. Without such controlled comparisons the claim that the residual network “trained by the speed-perturbed group delay gram” is the decisive factor remains unsupported by the reported evidence.
Authors: The manuscript already contains comparative EER tables for multiple magnitude- and phase-derived features (Table 2) and reports the effect of speed perturbation on the same architecture (Section 4.3). Nevertheless, we acknowledge that an explicit ablation isolating each factor is not referenced in the abstract. We will therefore insert a short ablation table (or reference an existing one more prominently) that shows incremental gains when adding speed perturbation and when switching to the group-delay gram; this will be added in the revised manuscript. revision: yes
Circularity Check
No significant circularity; empirical results on external benchmark
full rationale
The paper reports EER numbers obtained by training a ResNet on speed-perturbed group-delay features using the official ASVspoof 2019 PA training and development sets, then submitting scores for organizer-computed evaluation EER. No equations, derivations, or predictions are present that reduce by construction to fitted parameters or self-citations. The central claims are direct experimental outcomes on an externally defined challenge protocol with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The ASVspoof 2019 development and evaluation sets provide unbiased measures of replay detection performance.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our best single system employs a residual neural network trained by the speed-perturbed group delay gram. It achieves EER of 1.04% on the development set...
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The network structure here is somewhat similar to that one in [19]. ... we train a single DNN from scratch only using the ASVspoof 2019 training set.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Introduction Automatic speaker verification (ASV) refers to automatically accept or reject a claimed identity by analyzing speech utter- ances, and nowadays it is widely used in real-world biometric authentication applications [1–3]. Recently, a growing number of studies have confirmed the severe vulnerability of state-of- the-art ASV systems under a divers...
work page 2015
-
[2]
Methods 2.1. Utterance-level DNN framework All of our systems are built upon a unified utterance-level DNN framework. We first use the approach for the task of speaker and arXiv:1907.02663v1 [eess.AS] 5 Jul 2019 GAP Utterance-level representation Bona fide Variable-length feature sequence Convolutional layers Spoof Figure 1: Utterance-level DNN framework fo...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[3]
Experiments 3.1. Data protocol and evaluation metrics We strictly respect the official protocols defined in the eval- uation plan and submit both development and evaluation set scores. We have54 000 training utterances altogether, including 5400 bona fide audio and48 600 spoofed audio from various re- play conditions. The primary metric is the minimum normal...
work page 1953
-
[4]
Conclusion In this paper, we take the DKU system for ASVspoof 2019 chal- lenge as the pivot and explore the design of the countermeasure system against replay attack through data augmentation, fea- ture representation, classification, and fusion. First, we utilize a ResNet back-end classifier based on the introduced utterance- level deep learning framework....
work page 2019
-
[5]
Robust text-independent speaker identification using gaussian mixture speaker models,
D. A. Reynolds and R. C. Rose, “Robust text-independent speaker identification using gaussian mixture speaker models,” IEEE Transactions on Speech & Audio Processing , vol. 3, no. 1, pp. 72–83, 1995
work page 1995
-
[6]
An overview of text-independent speaker recognition: From features to supervectors,
T. Kinnunen and H. Li, “An overview of text-independent speaker recognition: From features to supervectors,” Speech Communica- tion, vol. 52, no. 1, pp. 12–40, 2010
work page 2010
-
[7]
Speaker recognition by machines and humans: A tutorial review,
J. H. L. Hansen and T. Hasan, “Speaker recognition by machines and humans: A tutorial review,” IEEE Signal Processing Maga- zine, vol. 32, no. 6, pp. 74–99, 2015
work page 2015
-
[8]
Spoofing and coun- termeasures for automatic speaker verification,
N. Evans, T. Kinnunen, and J. Yamagishi, “Spoofing and coun- termeasures for automatic speaker verification,” in Proc. INTER- SPEECH 2013, 2013
work page 2013
-
[9]
Spoofing and countermeasures for speaker verification: A sur- vey,
Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li, “Spoofing and countermeasures for speaker verification: A sur- vey,”Speech Communication, vol. 66, pp. 130–153, 2015
work page 2015
-
[10]
On the vulnerability of speaker verification to realistic voice spoofing,
S. K. Ergunay, E. Khoury, A. Lazaridis, and S. Marcel, “On the vulnerability of speaker verification to realistic voice spoofing,” in IEEE International Conference on Biometrics Theory, Applica- tions and Systems, 2015, pp. 1–6
work page 2015
-
[11]
Asvspoof: the automatic speaker verification spoofing and countermeasures challenge,
Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilci, M. Sahidullah, A. Sizov, N. Evans, M. Todisco, and H. Delgado, “Asvspoof: the automatic speaker verification spoofing and countermeasures challenge,” IEEE Journal of Selected Topics in Signal Processing, vol. PP, no. 99, pp. 1–1, 2017
work page 2017
-
[12]
Asvspoof 2015: the first automatic speaker verifi- cation spoofing and countermeasures challenge,
Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilc, S. M., and S. Aleksandr, “Asvspoof 2015: the first automatic speaker verifi- cation spoofing and countermeasures challenge,” inProc. INTER- SPEECH 2015, 2015
work page 2015
-
[13]
T. Kinnunen, M. Sahidullah, M. Falcone, L. Costantini, R. G. Hautamki, D. Thomsen, A. Sarkar, Z. Tan, H. Delgado, and M. Todisco, “Reddots replayed: A new replay spoofing attack corpus for text-dependent speaker verification research,” in Proc. ICCASP 2017, 2017
work page 2017
-
[14]
The asvspoof 2017 challenge: As- sessing the limits of replay spoofing attack detection,
T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Yamagishi, and K. A. Lee, “The asvspoof 2017 challenge: As- sessing the limits of replay spoofing attack detection,” in Proc. Interspeech 2017, 2017, pp. 2–6
work page 2017
-
[15]
Asvspoof 2017 version 2.0: meta-data analysis and baseline enhancements,
H. Delgado, M. Todisco, M. Sahidullah, N. Evans, T. Kin- nunen, K. A. Lee, and J. Yamagishi, “Asvspoof 2017 version 2.0: meta-data analysis and baseline enhancements,” in Proc. Speaker Odyssey 2018, 2018, pp. 296–303
work page 2017
- [16]
-
[17]
Classifiers for synthetic speech detection: A comparison,
C. Hanili, T. Kinnunen, M. Sahidullah, and A. Sizov, “Classifiers for synthetic speech detection: A comparison,” in Proc. INTER- SPEECH 2015, 2015
work page 2015
-
[18]
A new feature for auto- matic speaker verification anti-spoofing: Constant q cepstral co- efficients,
M. Todisco, H. Delgado, and N. Evans, “A new feature for auto- matic speaker verification anti-spoofing: Constant q cepstral co- efficients,” in Proc. Speaker Odyssey 2016, 2016
work page 2016
-
[19]
Calculation of a constant q spectral transform,
J. C. Brown, “Calculation of a constant q spectral transform,” Journal of the Acoustical Society of America , vol. 89, no. 1, pp. 425–434, 1991
work page 1991
-
[20]
Constant q cepstral co- efficients: A spoofing countermeasure for automatic speaker ver- ification,
M. Todisco, H. Delgado, and N. Evans, “Constant q cepstral co- efficients: A spoofing countermeasure for automatic speaker ver- ification,” Computer Speech & Language, vol. 45, pp. 516 – 535, 2017
work page 2017
-
[21]
An investigation of deep- learning frameworks for speaker verification antispoofing,
C. Zhang, C. Yu, and J. H. L. Hansen, “An investigation of deep- learning frameworks for speaker verification antispoofing,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 4, pp. 684–694, June 2017
work page 2017
-
[22]
Audio replay attack detection with deep learning frameworks,
G. Lavrentyeva, S. Novoselov, E. Malykh, A. Kozlov, O. Kuda- shev, and V . Shchemelinin, “Audio replay attack detection with deep learning frameworks,” inProc. INTERSPEECH 2017, 2017, pp. 82–86
work page 2017
-
[23]
End-To-End Audio Replay Attack Detection Using Deep Convolutional Networks with Attention,
F. Tom, M. Jain, and P. Dey, “End-To-End Audio Replay Attack Detection Using Deep Convolutional Networks with Attention,” in Proc. INTERSPEECH 2018, 2018, pp. 681–685
work page 2018
-
[24]
W. Cai, D. Cai, W. Liu, G. Li, and M. Li, “Countermeasures for automatic speaker verification replay spoofing attack : On data augmentation, feature representation, classification and fusion,” in Proc. INTERSPEECH 2017, 2017, pp. 17–21
work page 2017
-
[25]
W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,” in Proc. Speaker Odyssey, 2018, pp. 74–81
work page 2018
-
[26]
Insights into end-to- end learning scheme for language identification,
W. Cai, Z. Cai, W. Liu, X. Wang, and M. Li, “Insights into end-to- end learning scheme for language identification,” inProc. ICASSP 2018, 2018, pp. 5209–5213
work page 2018
-
[27]
Analysis of length normalization in end-to-end speaker verification system,
W. Cai, J. Chen, and M. Li, “Analysis of length normalization in end-to-end speaker verification system,” inProc. INTERSPEECH 2018, 2018, pp. 3618–3622
work page 2018
-
[28]
Ima- genet: A large-scale hierarchical image database,
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Ima- genet: A large-scale hierarchical image database,” in Proc. CVPR 2009, 2009, pp. 248–255
work page 2009
-
[29]
M. Lin, Q. Chen, and S. Yan, “Network in network,” in Proc. ICLR 2014, 2014
work page 2014
-
[30]
A comparison of features for synthetic speech detection,
M. Sahidullah, T. Kinnunen, and C. Hanili, “A comparison of features for synthetic speech detection,” in Proc. INTERSPEECH 2015, 2015, pp. 2087–2091
work page 2015
-
[31]
The modified group delay func- tion and its application to phoneme recognition,
H. A. Murthy and V . Gadde, “The modified group delay func- tion and its application to phoneme recognition,” inProc. ICASSP 2003, vol. 1, 2003, pp. I–68
work page 2003
-
[32]
Significance of the Modified Group Delay Feature in Speech Recognition,
R. M. Hegde, H. A. Murthy, and V . R. R. Gadde, “Significance of the Modified Group Delay Feature in Speech Recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 1, pp. 190–202
-
[33]
Syn- thetic speech detection using phase information,
I. Saratxaga, J. Sanchez, Z. Wu, I. Hernaez, and E. Navas, “Syn- thetic speech detection using phase information,”Speech Commu- nication, vol. 81, pp. 30 – 41, 2016
work page 2016
-
[34]
On the usefulness of stft phase spectrum in human listening tests,
K. K. Paliwal and L. D. Alsteris, “On the usefulness of stft phase spectrum in human listening tests,”Speech Communication, vol. 45, no. 2, pp. 153 – 170, 2005
work page 2005
-
[35]
Audio aug- mentation for speech recognition,
T. Ko, V . Peddinti, D. Povey, and S. Khudanpur, “Audio aug- mentation for speech recognition,” inProc. INTERSPEECH 2015, 2015, pp. 3586–3589
work page 2015
-
[36]
T. Kinnunen, K. A. Lee, H. Delgado, N. Evans, M. Todisco, M. Sahidullah, J. Yamagishi, and D. A. Reynolds, “t-dcf: a de- tection cost function for the tandem assessment of spoofing coun- termeasures and automatic speaker verification,” inProc. Speaker Odyssey 2018, 2018, pp. 312–319
work page 2018
-
[37]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR 2016, 2016, pp. 770–778
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.