pith. sign in

arxiv: 1907.02663 · v1 · pith:XB3JNHDKnew · submitted 2019-07-05 · 📡 eess.AS · cs.CR· cs.LG· cs.MM· cs.SD

The DKU Replay Detection System for the ASVspoof 2019 Challenge: On Data Augmentation, Feature Representation, Classification, and Fusion

Pith reviewed 2026-05-25 02:11 UTC · model grok-4.3

classification 📡 eess.AS cs.CRcs.LGcs.MMcs.SD
keywords replay detectionanti-spoofingASVspoof 2019group delay gramresidual neural networkdata augmentationspeaker verificationphysical access
0
0 comments X

The pith

A residual neural network trained on speed-perturbed group delay grams detects replay attacks at 1.08% equal error rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that an utterance-level deep learning pipeline can counter replay spoofing in physical-access speaker verification by combining speed perturbation of raw waveforms, group delay gram features from the phase spectrum, and residual network classification. A single system reaches 1.04% equal error rate on the development partition and 1.08% on the evaluation partition of the ASVspoof 2019 data. Averaging scores across several systems trained on varied features lowers the evaluation equal error rate to 0.66%. A sympathetic reader would care because replay attacks remain a direct practical threat to voice authentication, and error rates below 1% indicate the countermeasure could become reliable enough for real deployment.

Core claim

The central claim is that an utterance-level residual neural network framework, when trained on speed-perturbed group delay gram features extracted from the phase spectrum, provides effective detection of physical access replay attacks, as shown by equal error rates of 1.04% on development data and 1.08% on evaluation data; simple score averaging across multiple such systems further reduces these rates to 0.24% and 0.66% respectively.

What carries the argument

The utterance-level residual neural network that receives variable-length feature sequences and outputs utterance-level scores directly.

If this is right

  • Speed perturbation applied to raw waveforms serves as effective data augmentation that improves model robustness for replay detection.
  • Group delay gram features derived from the phase spectrum outperform magnitude-spectrum alternatives in this classification task.
  • Simple averaging of output scores from multiple systems trained on different feature representations yields further error-rate reductions.
  • The framework directly accepts variable-length inputs and produces utterance-level decisions without intermediate fixed-length segmentation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reported error rates suggest the pipeline could be paired with existing speaker verification systems to reduce successful replay attempts in deployed voice services.
  • Phase-derived representations such as group delay grams may prove useful for detecting other forms of audio spoofing beyond physical-access replays.
  • Testing the same architecture on replay data gathered under acoustic conditions or device sets outside the challenge would reveal whether performance depends on the specific training distribution.

Load-bearing premise

The ASVspoof 2019 challenge data splits and evaluation protocol are assumed to be representative of real-world physical access replay attacks, with no post-hoc selection or overfitting to the specific test conditions.

What would settle it

Evaluating the trained system on a fresh collection of replay attacks recorded with devices, rooms, or playback equipment absent from the ASVspoof 2019 corpus would determine whether the reported equal error rates generalize.

Figures

Figures reproduced from arXiv: 1907.02663 by Danwei Cai, Haiwei Wu, Ming Li, Weicheng Cai.

Figure 1
Figure 1. Figure 1: Utterance-level DNN framework for anti-spoofing. It accepts input data sequence with variable length, and produces an utterance-level result directly from the output of the DNN. language recognition in previous works [21–23]. As demon￾strated in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

This paper describes our DKU replay detection system for the ASVspoof 2019 challenge. The goal is to develop spoofing countermeasure for automatic speaker recognition in physical access scenario. We leverage the countermeasure system pipeline from four aspects, including the data augmentation, feature representation, classification, and fusion. First, we introduce an utterance-level deep learning framework for anti-spoofing. It receives the variable-length feature sequence and outputs the utterance-level scores directly. Based on the framework, we try out various kinds of input feature representations extracted from either the magnitude spectrum or phase spectrum. Besides, we also perform the data augmentation strategy by applying the speed perturbation on the raw waveform. Our best single system employs a residual neural network trained by the speed-perturbed group delay gram. It achieves EER of 1.04% on the development set, as well as EER of 1.08% on the evaluation set. Finally, using the simple average score from several single systems can further improve the performance. EER of 0.24% on the development set and 0.66% on the evaluation set is obtained for our primary system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript describes the DKU replay detection system submitted to the ASVspoof 2019 challenge for physical-access spoofing countermeasures. It introduces an utterance-level deep neural network framework that processes variable-length feature sequences and outputs utterance-level scores, explores multiple input representations derived from magnitude or phase spectra, applies speed perturbation as data augmentation on the raw waveform, and reports results from a residual network trained on speed-perturbed group-delay grams together with simple score-level fusion of several single systems.

Significance. If the reported EER figures hold under the official protocol, the work supplies concrete evidence that phase-derived features combined with speed perturbation can yield single-system EERs below 1.1 % on both development and evaluation partitions of the ASVspoof 2019 PA task, and that straightforward fusion can push performance still lower. Such empirical benchmarks are useful for the community even when the underlying architecture is standard.

major comments (2)
  1. [Abstract] Abstract: the central performance claims (EER = 1.04 % on the development set and 1.08 % on the evaluation set for the best single system) are stated without any accompanying description of the training procedure, optimizer, learning-rate schedule, early-stopping criterion, or number of random seeds. Because the evaluation EER is obtained after tuning on the development set, these omissions make it impossible to judge whether the quoted figures are reproducible or the result of post-hoc selection.
  2. [Abstract] Abstract: no ablation or comparative table is referenced that isolates the contribution of the group-delay gram versus other phase or magnitude features, nor the incremental gain from speed perturbation. Without such controlled comparisons the claim that the residual network “trained by the speed-perturbed group delay gram” is the decisive factor remains unsupported by the reported evidence.
minor comments (1)
  1. [Abstract] The abstract states that “several single systems” are fused by simple averaging but does not indicate how many systems, which feature sets, or whether the fusion weights were tuned on the development set.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our ASVspoof 2019 PA system description. We address the two major comments point-by-point below and will revise the manuscript to improve clarity and support for the reported claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central performance claims (EER = 1.04 % on the development set and 1.08 % on the evaluation set for the best single system) are stated without any accompanying description of the training procedure, optimizer, learning-rate schedule, early-stopping criterion, or number of random seeds. Because the evaluation EER is obtained after tuning on the development set, these omissions make it impossible to judge whether the quoted figures are reproducible or the result of post-hoc selection.

    Authors: We agree that the abstract lacks sufficient experimental-setup details to support reproducibility claims. In the revised version we will add a concise sentence summarizing the optimizer (Adam), initial learning rate, schedule, and early-stopping rule used for the ResNet, while keeping the abstract within length limits. Full hyper-parameter values and the number of random seeds are already stated in Section 3.2; the revision will simply make this information visible from the abstract. revision: yes

  2. Referee: [Abstract] Abstract: no ablation or comparative table is referenced that isolates the contribution of the group-delay gram versus other phase or magnitude features, nor the incremental gain from speed perturbation. Without such controlled comparisons the claim that the residual network “trained by the speed-perturbed group delay gram” is the decisive factor remains unsupported by the reported evidence.

    Authors: The manuscript already contains comparative EER tables for multiple magnitude- and phase-derived features (Table 2) and reports the effect of speed perturbation on the same architecture (Section 4.3). Nevertheless, we acknowledge that an explicit ablation isolating each factor is not referenced in the abstract. We will therefore insert a short ablation table (or reference an existing one more prominently) that shows incremental gains when adding speed perturbation and when switching to the group-delay gram; this will be added in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on external benchmark

full rationale

The paper reports EER numbers obtained by training a ResNet on speed-perturbed group-delay features using the official ASVspoof 2019 PA training and development sets, then submitting scores for organizer-computed evaluation EER. No equations, derivations, or predictions are present that reduce by construction to fitted parameters or self-citations. The central claims are direct experimental outcomes on an externally defined challenge protocol with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based solely on abstract; no mathematical derivations or new entities are present. The work rests on standard supervised learning assumptions and the validity of the ASVspoof 2019 dataset labels and splits.

axioms (1)
  • domain assumption The ASVspoof 2019 development and evaluation sets provide unbiased measures of replay detection performance.
    Invoked implicitly by reporting EER on those sets as the primary result.

pith-pipeline@v0.9.0 · 5766 in / 1140 out tokens · 19177 ms · 2026-05-25T02:11:41.147352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 1 internal anchor

  1. [1]

    in the wild

    Introduction Automatic speaker verification (ASV) refers to automatically accept or reject a claimed identity by analyzing speech utter- ances, and nowadays it is widely used in real-world biometric authentication applications [1–3]. Recently, a growing number of studies have confirmed the severe vulnerability of state-of- the-art ASV systems under a divers...

  2. [2]

    The DKU Replay Detection System for the ASVspoof 2019 Challenge: On Data Augmentation, Feature Representation, Classification, and Fusion

    Methods 2.1. Utterance-level DNN framework All of our systems are built upon a unified utterance-level DNN framework. We first use the approach for the task of speaker and arXiv:1907.02663v1 [eess.AS] 5 Jul 2019 GAP Utterance-level representation Bona fide Variable-length feature sequence Convolutional layers Spoof Figure 1: Utterance-level DNN framework fo...

  3. [3]

    Data protocol and evaluation metrics We strictly respect the official protocols defined in the eval- uation plan and submit both development and evaluation set scores

    Experiments 3.1. Data protocol and evaluation metrics We strictly respect the official protocols defined in the eval- uation plan and submit both development and evaluation set scores. We have54 000 training utterances altogether, including 5400 bona fide audio and48 600 spoofed audio from various re- play conditions. The primary metric is the minimum normal...

  4. [4]

    First, we utilize a ResNet back-end classifier based on the introduced utterance- level deep learning framework

    Conclusion In this paper, we take the DKU system for ASVspoof 2019 chal- lenge as the pivot and explore the design of the countermeasure system against replay attack through data augmentation, fea- ture representation, classification, and fusion. First, we utilize a ResNet back-end classifier based on the introduced utterance- level deep learning framework....

  5. [5]

    Robust text-independent speaker identification using gaussian mixture speaker models,

    D. A. Reynolds and R. C. Rose, “Robust text-independent speaker identification using gaussian mixture speaker models,” IEEE Transactions on Speech & Audio Processing , vol. 3, no. 1, pp. 72–83, 1995

  6. [6]

    An overview of text-independent speaker recognition: From features to supervectors,

    T. Kinnunen and H. Li, “An overview of text-independent speaker recognition: From features to supervectors,” Speech Communica- tion, vol. 52, no. 1, pp. 12–40, 2010

  7. [7]

    Speaker recognition by machines and humans: A tutorial review,

    J. H. L. Hansen and T. Hasan, “Speaker recognition by machines and humans: A tutorial review,” IEEE Signal Processing Maga- zine, vol. 32, no. 6, pp. 74–99, 2015

  8. [8]

    Spoofing and coun- termeasures for automatic speaker verification,

    N. Evans, T. Kinnunen, and J. Yamagishi, “Spoofing and coun- termeasures for automatic speaker verification,” in Proc. INTER- SPEECH 2013, 2013

  9. [9]

    Spoofing and countermeasures for speaker verification: A sur- vey,

    Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li, “Spoofing and countermeasures for speaker verification: A sur- vey,”Speech Communication, vol. 66, pp. 130–153, 2015

  10. [10]

    On the vulnerability of speaker verification to realistic voice spoofing,

    S. K. Ergunay, E. Khoury, A. Lazaridis, and S. Marcel, “On the vulnerability of speaker verification to realistic voice spoofing,” in IEEE International Conference on Biometrics Theory, Applica- tions and Systems, 2015, pp. 1–6

  11. [11]

    Asvspoof: the automatic speaker verification spoofing and countermeasures challenge,

    Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilci, M. Sahidullah, A. Sizov, N. Evans, M. Todisco, and H. Delgado, “Asvspoof: the automatic speaker verification spoofing and countermeasures challenge,” IEEE Journal of Selected Topics in Signal Processing, vol. PP, no. 99, pp. 1–1, 2017

  12. [12]

    Asvspoof 2015: the first automatic speaker verifi- cation spoofing and countermeasures challenge,

    Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilc, S. M., and S. Aleksandr, “Asvspoof 2015: the first automatic speaker verifi- cation spoofing and countermeasures challenge,” inProc. INTER- SPEECH 2015, 2015

  13. [13]

    Reddots replayed: A new replay spoofing attack corpus for text-dependent speaker verification research,

    T. Kinnunen, M. Sahidullah, M. Falcone, L. Costantini, R. G. Hautamki, D. Thomsen, A. Sarkar, Z. Tan, H. Delgado, and M. Todisco, “Reddots replayed: A new replay spoofing attack corpus for text-dependent speaker verification research,” in Proc. ICCASP 2017, 2017

  14. [14]

    The asvspoof 2017 challenge: As- sessing the limits of replay spoofing attack detection,

    T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Yamagishi, and K. A. Lee, “The asvspoof 2017 challenge: As- sessing the limits of replay spoofing attack detection,” in Proc. Interspeech 2017, 2017, pp. 2–6

  15. [15]

    Asvspoof 2017 version 2.0: meta-data analysis and baseline enhancements,

    H. Delgado, M. Todisco, M. Sahidullah, N. Evans, T. Kin- nunen, K. A. Lee, and J. Yamagishi, “Asvspoof 2017 version 2.0: meta-data analysis and baseline enhancements,” in Proc. Speaker Odyssey 2018, 2018, pp. 296–303

  16. [16]

    [Online]

    Asvspoof 2019 evaluation plan. [Online]. Available: http://www. asvspoof.org/asvspoof2019/asvspoof2019 evaluation plan.pdf

  17. [17]

    Classifiers for synthetic speech detection: A comparison,

    C. Hanili, T. Kinnunen, M. Sahidullah, and A. Sizov, “Classifiers for synthetic speech detection: A comparison,” in Proc. INTER- SPEECH 2015, 2015

  18. [18]

    A new feature for auto- matic speaker verification anti-spoofing: Constant q cepstral co- efficients,

    M. Todisco, H. Delgado, and N. Evans, “A new feature for auto- matic speaker verification anti-spoofing: Constant q cepstral co- efficients,” in Proc. Speaker Odyssey 2016, 2016

  19. [19]

    Calculation of a constant q spectral transform,

    J. C. Brown, “Calculation of a constant q spectral transform,” Journal of the Acoustical Society of America , vol. 89, no. 1, pp. 425–434, 1991

  20. [20]

    Constant q cepstral co- efficients: A spoofing countermeasure for automatic speaker ver- ification,

    M. Todisco, H. Delgado, and N. Evans, “Constant q cepstral co- efficients: A spoofing countermeasure for automatic speaker ver- ification,” Computer Speech & Language, vol. 45, pp. 516 – 535, 2017

  21. [21]

    An investigation of deep- learning frameworks for speaker verification antispoofing,

    C. Zhang, C. Yu, and J. H. L. Hansen, “An investigation of deep- learning frameworks for speaker verification antispoofing,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 4, pp. 684–694, June 2017

  22. [22]

    Audio replay attack detection with deep learning frameworks,

    G. Lavrentyeva, S. Novoselov, E. Malykh, A. Kozlov, O. Kuda- shev, and V . Shchemelinin, “Audio replay attack detection with deep learning frameworks,” inProc. INTERSPEECH 2017, 2017, pp. 82–86

  23. [23]

    End-To-End Audio Replay Attack Detection Using Deep Convolutional Networks with Attention,

    F. Tom, M. Jain, and P. Dey, “End-To-End Audio Replay Attack Detection Using Deep Convolutional Networks with Attention,” in Proc. INTERSPEECH 2018, 2018, pp. 681–685

  24. [24]

    Countermeasures for automatic speaker verification replay spoofing attack : On data augmentation, feature representation, classification and fusion,

    W. Cai, D. Cai, W. Liu, G. Li, and M. Li, “Countermeasures for automatic speaker verification replay spoofing attack : On data augmentation, feature representation, classification and fusion,” in Proc. INTERSPEECH 2017, 2017, pp. 17–21

  25. [25]

    Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,

    W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,” in Proc. Speaker Odyssey, 2018, pp. 74–81

  26. [26]

    Insights into end-to- end learning scheme for language identification,

    W. Cai, Z. Cai, W. Liu, X. Wang, and M. Li, “Insights into end-to- end learning scheme for language identification,” inProc. ICASSP 2018, 2018, pp. 5209–5213

  27. [27]

    Analysis of length normalization in end-to-end speaker verification system,

    W. Cai, J. Chen, and M. Li, “Analysis of length normalization in end-to-end speaker verification system,” inProc. INTERSPEECH 2018, 2018, pp. 3618–3622

  28. [28]

    Ima- genet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Ima- genet: A large-scale hierarchical image database,” in Proc. CVPR 2009, 2009, pp. 248–255

  29. [29]

    Network in network,

    M. Lin, Q. Chen, and S. Yan, “Network in network,” in Proc. ICLR 2014, 2014

  30. [30]

    A comparison of features for synthetic speech detection,

    M. Sahidullah, T. Kinnunen, and C. Hanili, “A comparison of features for synthetic speech detection,” in Proc. INTERSPEECH 2015, 2015, pp. 2087–2091

  31. [31]

    The modified group delay func- tion and its application to phoneme recognition,

    H. A. Murthy and V . Gadde, “The modified group delay func- tion and its application to phoneme recognition,” inProc. ICASSP 2003, vol. 1, 2003, pp. I–68

  32. [32]

    Significance of the Modified Group Delay Feature in Speech Recognition,

    R. M. Hegde, H. A. Murthy, and V . R. R. Gadde, “Significance of the Modified Group Delay Feature in Speech Recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 1, pp. 190–202

  33. [33]

    Syn- thetic speech detection using phase information,

    I. Saratxaga, J. Sanchez, Z. Wu, I. Hernaez, and E. Navas, “Syn- thetic speech detection using phase information,”Speech Commu- nication, vol. 81, pp. 30 – 41, 2016

  34. [34]

    On the usefulness of stft phase spectrum in human listening tests,

    K. K. Paliwal and L. D. Alsteris, “On the usefulness of stft phase spectrum in human listening tests,”Speech Communication, vol. 45, no. 2, pp. 153 – 170, 2005

  35. [35]

    Audio aug- mentation for speech recognition,

    T. Ko, V . Peddinti, D. Povey, and S. Khudanpur, “Audio aug- mentation for speech recognition,” inProc. INTERSPEECH 2015, 2015, pp. 3586–3589

  36. [36]

    t-dcf: a de- tection cost function for the tandem assessment of spoofing coun- termeasures and automatic speaker verification,

    T. Kinnunen, K. A. Lee, H. Delgado, N. Evans, M. Todisco, M. Sahidullah, J. Yamagishi, and D. A. Reynolds, “t-dcf: a de- tection cost function for the tandem assessment of spoofing coun- termeasures and automatic speaker verification,” inProc. Speaker Odyssey 2018, 2018, pp. 312–319

  37. [37]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR 2016, 2016, pp. 770–778