The DKU Replay Detection System for the ASVspoof 2019 Challenge: On Data Augmentation, Feature Representation, Classification, and Fusion

Danwei Cai; Haiwei Wu; Ming Li; Weicheng Cai

arxiv: 1907.02663 · v1 · pith:XB3JNHDKnew · submitted 2019-07-05 · 📡 eess.AS · cs.CR· cs.LG· cs.MM· cs.SD

The DKU Replay Detection System for the ASVspoof 2019 Challenge: On Data Augmentation, Feature Representation, Classification, and Fusion

Weicheng Cai , Haiwei Wu , Danwei Cai , Ming Li This is my paper

Pith reviewed 2026-05-25 02:11 UTC · model grok-4.3

classification 📡 eess.AS cs.CRcs.LGcs.MMcs.SD

keywords replay detectionanti-spoofingASVspoof 2019group delay gramresidual neural networkdata augmentationspeaker verificationphysical access

0 comments

The pith

A residual neural network trained on speed-perturbed group delay grams detects replay attacks at 1.08% equal error rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that an utterance-level deep learning pipeline can counter replay spoofing in physical-access speaker verification by combining speed perturbation of raw waveforms, group delay gram features from the phase spectrum, and residual network classification. A single system reaches 1.04% equal error rate on the development partition and 1.08% on the evaluation partition of the ASVspoof 2019 data. Averaging scores across several systems trained on varied features lowers the evaluation equal error rate to 0.66%. A sympathetic reader would care because replay attacks remain a direct practical threat to voice authentication, and error rates below 1% indicate the countermeasure could become reliable enough for real deployment.

Core claim

The central claim is that an utterance-level residual neural network framework, when trained on speed-perturbed group delay gram features extracted from the phase spectrum, provides effective detection of physical access replay attacks, as shown by equal error rates of 1.04% on development data and 1.08% on evaluation data; simple score averaging across multiple such systems further reduces these rates to 0.24% and 0.66% respectively.

What carries the argument

The utterance-level residual neural network that receives variable-length feature sequences and outputs utterance-level scores directly.

If this is right

Speed perturbation applied to raw waveforms serves as effective data augmentation that improves model robustness for replay detection.
Group delay gram features derived from the phase spectrum outperform magnitude-spectrum alternatives in this classification task.
Simple averaging of output scores from multiple systems trained on different feature representations yields further error-rate reductions.
The framework directly accepts variable-length inputs and produces utterance-level decisions without intermediate fixed-length segmentation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reported error rates suggest the pipeline could be paired with existing speaker verification systems to reduce successful replay attempts in deployed voice services.
Phase-derived representations such as group delay grams may prove useful for detecting other forms of audio spoofing beyond physical-access replays.
Testing the same architecture on replay data gathered under acoustic conditions or device sets outside the challenge would reveal whether performance depends on the specific training distribution.

Load-bearing premise

The ASVspoof 2019 challenge data splits and evaluation protocol are assumed to be representative of real-world physical access replay attacks, with no post-hoc selection or overfitting to the specific test conditions.

What would settle it

Evaluating the trained system on a fresh collection of replay attacks recorded with devices, rooms, or playback equipment absent from the ASVspoof 2019 corpus would determine whether the reported equal error rates generalize.

Figures

Figures reproduced from arXiv: 1907.02663 by Danwei Cai, Haiwei Wu, Ming Li, Weicheng Cai.

**Figure 1.** Figure 1: Utterance-level DNN framework for anti-spoofing. It accepts input data sequence with variable length, and produces an utterance-level result directly from the output of the DNN. language recognition in previous works [21–23]. As demonstrated in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

read the original abstract

This paper describes our DKU replay detection system for the ASVspoof 2019 challenge. The goal is to develop spoofing countermeasure for automatic speaker recognition in physical access scenario. We leverage the countermeasure system pipeline from four aspects, including the data augmentation, feature representation, classification, and fusion. First, we introduce an utterance-level deep learning framework for anti-spoofing. It receives the variable-length feature sequence and outputs the utterance-level scores directly. Based on the framework, we try out various kinds of input feature representations extracted from either the magnitude spectrum or phase spectrum. Besides, we also perform the data augmentation strategy by applying the speed perturbation on the raw waveform. Our best single system employs a residual neural network trained by the speed-perturbed group delay gram. It achieves EER of 1.04% on the development set, as well as EER of 1.08% on the evaluation set. Finally, using the simple average score from several single systems can further improve the performance. EER of 0.24% on the development set and 0.66% on the evaluation set is obtained for our primary system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Solid challenge entry with strong EERs on ASVspoof 2019 PA using ResNet on speed-perturbed group delay features, but applies known tools without new framework.

read the letter

This paper reports the DKU system's results for the ASVspoof 2019 physical access replay detection task. Their best single system uses a residual network on speed-perturbed group delay gram features and hits 1.04% EER on dev and 1.08% on eval. Fusion of several systems drops the eval EER to 0.66%. They test multiple magnitude and phase-based features inside an utterance-level deep classifier and add speed perturbation on the waveform for augmentation. The work follows the official challenge splits and protocol exactly. What it does well is deliver competitive numbers through careful feature and augmentation choices on a standard benchmark. The pipeline is described clearly enough in the abstract to show they tried several representations and combined them simply. The central performance claim is grounded in the external organizer-scored evaluation, which reduces circularity risk. Soft spots are the absence of training details, validation splits, error bars, or ablation tables in the provided abstract; the full paper would need to show those to confirm the numbers aren't from heavy post-hoc tuning on dev. Novelty is limited to applying established ResNet and perturbation ideas to this specific task rather than introducing a new method or derivation. The data is assumed to represent real replay attacks, which is the usual challenge framing but not proven here. This paper is for teams tracking anti-spoofing progress on public benchmarks. A reader working on voice biometrics or challenge systems gets practical value from the feature comparisons and fusion results. It deserves serious peer review because the results are sharp on an established protocol and the approach is reproducible in principle.

Referee Report

2 major / 1 minor

Summary. The manuscript describes the DKU replay detection system submitted to the ASVspoof 2019 challenge for physical-access spoofing countermeasures. It introduces an utterance-level deep neural network framework that processes variable-length feature sequences and outputs utterance-level scores, explores multiple input representations derived from magnitude or phase spectra, applies speed perturbation as data augmentation on the raw waveform, and reports results from a residual network trained on speed-perturbed group-delay grams together with simple score-level fusion of several single systems.

Significance. If the reported EER figures hold under the official protocol, the work supplies concrete evidence that phase-derived features combined with speed perturbation can yield single-system EERs below 1.1 % on both development and evaluation partitions of the ASVspoof 2019 PA task, and that straightforward fusion can push performance still lower. Such empirical benchmarks are useful for the community even when the underlying architecture is standard.

major comments (2)

[Abstract] Abstract: the central performance claims (EER = 1.04 % on the development set and 1.08 % on the evaluation set for the best single system) are stated without any accompanying description of the training procedure, optimizer, learning-rate schedule, early-stopping criterion, or number of random seeds. Because the evaluation EER is obtained after tuning on the development set, these omissions make it impossible to judge whether the quoted figures are reproducible or the result of post-hoc selection.
[Abstract] Abstract: no ablation or comparative table is referenced that isolates the contribution of the group-delay gram versus other phase or magnitude features, nor the incremental gain from speed perturbation. Without such controlled comparisons the claim that the residual network “trained by the speed-perturbed group delay gram” is the decisive factor remains unsupported by the reported evidence.

minor comments (1)

[Abstract] The abstract states that “several single systems” are fused by simple averaging but does not indicate how many systems, which feature sets, or whether the fusion weights were tuned on the development set.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our ASVspoof 2019 PA system description. We address the two major comments point-by-point below and will revise the manuscript to improve clarity and support for the reported claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claims (EER = 1.04 % on the development set and 1.08 % on the evaluation set for the best single system) are stated without any accompanying description of the training procedure, optimizer, learning-rate schedule, early-stopping criterion, or number of random seeds. Because the evaluation EER is obtained after tuning on the development set, these omissions make it impossible to judge whether the quoted figures are reproducible or the result of post-hoc selection.

Authors: We agree that the abstract lacks sufficient experimental-setup details to support reproducibility claims. In the revised version we will add a concise sentence summarizing the optimizer (Adam), initial learning rate, schedule, and early-stopping rule used for the ResNet, while keeping the abstract within length limits. Full hyper-parameter values and the number of random seeds are already stated in Section 3.2; the revision will simply make this information visible from the abstract. revision: yes
Referee: [Abstract] Abstract: no ablation or comparative table is referenced that isolates the contribution of the group-delay gram versus other phase or magnitude features, nor the incremental gain from speed perturbation. Without such controlled comparisons the claim that the residual network “trained by the speed-perturbed group delay gram” is the decisive factor remains unsupported by the reported evidence.

Authors: The manuscript already contains comparative EER tables for multiple magnitude- and phase-derived features (Table 2) and reports the effect of speed perturbation on the same architecture (Section 4.3). Nevertheless, we acknowledge that an explicit ablation isolating each factor is not referenced in the abstract. We will therefore insert a short ablation table (or reference an existing one more prominently) that shows incremental gains when adding speed perturbation and when switching to the group-delay gram; this will be added in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on external benchmark

full rationale

The paper reports EER numbers obtained by training a ResNet on speed-perturbed group-delay features using the official ASVspoof 2019 PA training and development sets, then submitting scores for organizer-computed evaluation EER. No equations, derivations, or predictions are present that reduce by construction to fitted parameters or self-citations. The central claims are direct experimental outcomes on an externally defined challenge protocol with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based solely on abstract; no mathematical derivations or new entities are present. The work rests on standard supervised learning assumptions and the validity of the ASVspoof 2019 dataset labels and splits.

axioms (1)

domain assumption The ASVspoof 2019 development and evaluation sets provide unbiased measures of replay detection performance.
Invoked implicitly by reporting EER on those sets as the primary result.

pith-pipeline@v0.9.0 · 5766 in / 1140 out tokens · 19177 ms · 2026-05-25T02:11:41.147352+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our best single system employs a residual neural network trained by the speed-perturbed group delay gram. It achieves EER of 1.04% on the development set...
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The network structure here is somewhat similar to that one in [19]. ... we train a single DNN from scratch only using the ASVspoof 2019 training set.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 1 internal anchor

[1]

in the wild

Introduction Automatic speaker veriﬁcation (ASV) refers to automatically accept or reject a claimed identity by analyzing speech utter- ances, and nowadays it is widely used in real-world biometric authentication applications [1–3]. Recently, a growing number of studies have conﬁrmed the severe vulnerability of state-of- the-art ASV systems under a divers...

work page 2015
[2]

The DKU Replay Detection System for the ASVspoof 2019 Challenge: On Data Augmentation, Feature Representation, Classification, and Fusion

Methods 2.1. Utterance-level DNN framework All of our systems are built upon a uniﬁed utterance-level DNN framework. We ﬁrst use the approach for the task of speaker and arXiv:1907.02663v1 [eess.AS] 5 Jul 2019 GAP Utterance-level representation Bona fide Variable-length feature sequence Convolutional layers Spoof Figure 1: Utterance-level DNN framework fo...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[3]

Data protocol and evaluation metrics We strictly respect the ofﬁcial protocols deﬁned in the eval- uation plan and submit both development and evaluation set scores

Experiments 3.1. Data protocol and evaluation metrics We strictly respect the ofﬁcial protocols deﬁned in the eval- uation plan and submit both development and evaluation set scores. We have54 000 training utterances altogether, including 5400 bona ﬁde audio and48 600 spoofed audio from various re- play conditions. The primary metric is the minimum normal...

work page 1953
[4]

First, we utilize a ResNet back-end classiﬁer based on the introduced utterance- level deep learning framework

Conclusion In this paper, we take the DKU system for ASVspoof 2019 chal- lenge as the pivot and explore the design of the countermeasure system against replay attack through data augmentation, fea- ture representation, classiﬁcation, and fusion. First, we utilize a ResNet back-end classiﬁer based on the introduced utterance- level deep learning framework....

work page 2019
[5]

Robust text-independent speaker identiﬁcation using gaussian mixture speaker models,

D. A. Reynolds and R. C. Rose, “Robust text-independent speaker identiﬁcation using gaussian mixture speaker models,” IEEE Transactions on Speech & Audio Processing , vol. 3, no. 1, pp. 72–83, 1995

work page 1995
[6]

An overview of text-independent speaker recognition: From features to supervectors,

T. Kinnunen and H. Li, “An overview of text-independent speaker recognition: From features to supervectors,” Speech Communica- tion, vol. 52, no. 1, pp. 12–40, 2010

work page 2010
[7]

Speaker recognition by machines and humans: A tutorial review,

J. H. L. Hansen and T. Hasan, “Speaker recognition by machines and humans: A tutorial review,” IEEE Signal Processing Maga- zine, vol. 32, no. 6, pp. 74–99, 2015

work page 2015
[8]

Spooﬁng and coun- termeasures for automatic speaker veriﬁcation,

N. Evans, T. Kinnunen, and J. Yamagishi, “Spooﬁng and coun- termeasures for automatic speaker veriﬁcation,” in Proc. INTER- SPEECH 2013, 2013

work page 2013
[9]

Spooﬁng and countermeasures for speaker veriﬁcation: A sur- vey,

Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li, “Spooﬁng and countermeasures for speaker veriﬁcation: A sur- vey,”Speech Communication, vol. 66, pp. 130–153, 2015

work page 2015
[10]

On the vulnerability of speaker veriﬁcation to realistic voice spooﬁng,

S. K. Ergunay, E. Khoury, A. Lazaridis, and S. Marcel, “On the vulnerability of speaker veriﬁcation to realistic voice spooﬁng,” in IEEE International Conference on Biometrics Theory, Applica- tions and Systems, 2015, pp. 1–6

work page 2015
[11]

Asvspoof: the automatic speaker veriﬁcation spooﬁng and countermeasures challenge,

Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilci, M. Sahidullah, A. Sizov, N. Evans, M. Todisco, and H. Delgado, “Asvspoof: the automatic speaker veriﬁcation spooﬁng and countermeasures challenge,” IEEE Journal of Selected Topics in Signal Processing, vol. PP, no. 99, pp. 1–1, 2017

work page 2017
[12]

Asvspoof 2015: the ﬁrst automatic speaker veriﬁ- cation spooﬁng and countermeasures challenge,

Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilc, S. M., and S. Aleksandr, “Asvspoof 2015: the ﬁrst automatic speaker veriﬁ- cation spooﬁng and countermeasures challenge,” inProc. INTER- SPEECH 2015, 2015

work page 2015
[13]

Reddots replayed: A new replay spooﬁng attack corpus for text-dependent speaker veriﬁcation research,

T. Kinnunen, M. Sahidullah, M. Falcone, L. Costantini, R. G. Hautamki, D. Thomsen, A. Sarkar, Z. Tan, H. Delgado, and M. Todisco, “Reddots replayed: A new replay spooﬁng attack corpus for text-dependent speaker veriﬁcation research,” in Proc. ICCASP 2017, 2017

work page 2017
[14]

The asvspoof 2017 challenge: As- sessing the limits of replay spooﬁng attack detection,

T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Yamagishi, and K. A. Lee, “The asvspoof 2017 challenge: As- sessing the limits of replay spooﬁng attack detection,” in Proc. Interspeech 2017, 2017, pp. 2–6

work page 2017
[15]

Asvspoof 2017 version 2.0: meta-data analysis and baseline enhancements,

H. Delgado, M. Todisco, M. Sahidullah, N. Evans, T. Kin- nunen, K. A. Lee, and J. Yamagishi, “Asvspoof 2017 version 2.0: meta-data analysis and baseline enhancements,” in Proc. Speaker Odyssey 2018, 2018, pp. 296–303

work page 2017
[16]

[Online]

Asvspoof 2019 evaluation plan. [Online]. Available: http://www. asvspoof.org/asvspoof2019/asvspoof2019 evaluation plan.pdf

work page 2019
[17]

Classiﬁers for synthetic speech detection: A comparison,

C. Hanili, T. Kinnunen, M. Sahidullah, and A. Sizov, “Classiﬁers for synthetic speech detection: A comparison,” in Proc. INTER- SPEECH 2015, 2015

work page 2015
[18]

A new feature for auto- matic speaker veriﬁcation anti-spooﬁng: Constant q cepstral co- efﬁcients,

M. Todisco, H. Delgado, and N. Evans, “A new feature for auto- matic speaker veriﬁcation anti-spooﬁng: Constant q cepstral co- efﬁcients,” in Proc. Speaker Odyssey 2016, 2016

work page 2016
[19]

Calculation of a constant q spectral transform,

J. C. Brown, “Calculation of a constant q spectral transform,” Journal of the Acoustical Society of America , vol. 89, no. 1, pp. 425–434, 1991

work page 1991
[20]

Constant q cepstral co- efﬁcients: A spooﬁng countermeasure for automatic speaker ver- iﬁcation,

M. Todisco, H. Delgado, and N. Evans, “Constant q cepstral co- efﬁcients: A spooﬁng countermeasure for automatic speaker ver- iﬁcation,” Computer Speech & Language, vol. 45, pp. 516 – 535, 2017

work page 2017
[21]

An investigation of deep- learning frameworks for speaker veriﬁcation antispooﬁng,

C. Zhang, C. Yu, and J. H. L. Hansen, “An investigation of deep- learning frameworks for speaker veriﬁcation antispooﬁng,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 4, pp. 684–694, June 2017

work page 2017
[22]

Audio replay attack detection with deep learning frameworks,

G. Lavrentyeva, S. Novoselov, E. Malykh, A. Kozlov, O. Kuda- shev, and V . Shchemelinin, “Audio replay attack detection with deep learning frameworks,” inProc. INTERSPEECH 2017, 2017, pp. 82–86

work page 2017
[23]

End-To-End Audio Replay Attack Detection Using Deep Convolutional Networks with Attention,

F. Tom, M. Jain, and P. Dey, “End-To-End Audio Replay Attack Detection Using Deep Convolutional Networks with Attention,” in Proc. INTERSPEECH 2018, 2018, pp. 681–685

work page 2018
[24]

Countermeasures for automatic speaker veriﬁcation replay spooﬁng attack : On data augmentation, feature representation, classiﬁcation and fusion,

W. Cai, D. Cai, W. Liu, G. Li, and M. Li, “Countermeasures for automatic speaker veriﬁcation replay spooﬁng attack : On data augmentation, feature representation, classiﬁcation and fusion,” in Proc. INTERSPEECH 2017, 2017, pp. 17–21

work page 2017
[25]

Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,

W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,” in Proc. Speaker Odyssey, 2018, pp. 74–81

work page 2018
[26]

Insights into end-to- end learning scheme for language identiﬁcation,

W. Cai, Z. Cai, W. Liu, X. Wang, and M. Li, “Insights into end-to- end learning scheme for language identiﬁcation,” inProc. ICASSP 2018, 2018, pp. 5209–5213

work page 2018
[27]

Analysis of length normalization in end-to-end speaker veriﬁcation system,

W. Cai, J. Chen, and M. Li, “Analysis of length normalization in end-to-end speaker veriﬁcation system,” inProc. INTERSPEECH 2018, 2018, pp. 3618–3622

work page 2018
[28]

Ima- genet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Ima- genet: A large-scale hierarchical image database,” in Proc. CVPR 2009, 2009, pp. 248–255

work page 2009
[29]

Network in network,

M. Lin, Q. Chen, and S. Yan, “Network in network,” in Proc. ICLR 2014, 2014

work page 2014
[30]

A comparison of features for synthetic speech detection,

M. Sahidullah, T. Kinnunen, and C. Hanili, “A comparison of features for synthetic speech detection,” in Proc. INTERSPEECH 2015, 2015, pp. 2087–2091

work page 2015
[31]

The modiﬁed group delay func- tion and its application to phoneme recognition,

H. A. Murthy and V . Gadde, “The modiﬁed group delay func- tion and its application to phoneme recognition,” inProc. ICASSP 2003, vol. 1, 2003, pp. I–68

work page 2003
[32]

Signiﬁcance of the Modiﬁed Group Delay Feature in Speech Recognition,

R. M. Hegde, H. A. Murthy, and V . R. R. Gadde, “Signiﬁcance of the Modiﬁed Group Delay Feature in Speech Recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 1, pp. 190–202

work page
[33]

Syn- thetic speech detection using phase information,

I. Saratxaga, J. Sanchez, Z. Wu, I. Hernaez, and E. Navas, “Syn- thetic speech detection using phase information,”Speech Commu- nication, vol. 81, pp. 30 – 41, 2016

work page 2016
[34]

On the usefulness of stft phase spectrum in human listening tests,

K. K. Paliwal and L. D. Alsteris, “On the usefulness of stft phase spectrum in human listening tests,”Speech Communication, vol. 45, no. 2, pp. 153 – 170, 2005

work page 2005
[35]

Audio aug- mentation for speech recognition,

T. Ko, V . Peddinti, D. Povey, and S. Khudanpur, “Audio aug- mentation for speech recognition,” inProc. INTERSPEECH 2015, 2015, pp. 3586–3589

work page 2015
[36]

t-dcf: a de- tection cost function for the tandem assessment of spooﬁng coun- termeasures and automatic speaker veriﬁcation,

T. Kinnunen, K. A. Lee, H. Delgado, N. Evans, M. Todisco, M. Sahidullah, J. Yamagishi, and D. A. Reynolds, “t-dcf: a de- tection cost function for the tandem assessment of spooﬁng coun- termeasures and automatic speaker veriﬁcation,” inProc. Speaker Odyssey 2018, 2018, pp. 312–319

work page 2018
[37]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR 2016, 2016, pp. 770–778

work page 2016

[1] [1]

in the wild

Introduction Automatic speaker veriﬁcation (ASV) refers to automatically accept or reject a claimed identity by analyzing speech utter- ances, and nowadays it is widely used in real-world biometric authentication applications [1–3]. Recently, a growing number of studies have conﬁrmed the severe vulnerability of state-of- the-art ASV systems under a divers...

work page 2015

[2] [2]

The DKU Replay Detection System for the ASVspoof 2019 Challenge: On Data Augmentation, Feature Representation, Classification, and Fusion

Methods 2.1. Utterance-level DNN framework All of our systems are built upon a uniﬁed utterance-level DNN framework. We ﬁrst use the approach for the task of speaker and arXiv:1907.02663v1 [eess.AS] 5 Jul 2019 GAP Utterance-level representation Bona fide Variable-length feature sequence Convolutional layers Spoof Figure 1: Utterance-level DNN framework fo...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[3] [3]

Data protocol and evaluation metrics We strictly respect the ofﬁcial protocols deﬁned in the eval- uation plan and submit both development and evaluation set scores

Experiments 3.1. Data protocol and evaluation metrics We strictly respect the ofﬁcial protocols deﬁned in the eval- uation plan and submit both development and evaluation set scores. We have54 000 training utterances altogether, including 5400 bona ﬁde audio and48 600 spoofed audio from various re- play conditions. The primary metric is the minimum normal...

work page 1953

[4] [4]

First, we utilize a ResNet back-end classiﬁer based on the introduced utterance- level deep learning framework

Conclusion In this paper, we take the DKU system for ASVspoof 2019 chal- lenge as the pivot and explore the design of the countermeasure system against replay attack through data augmentation, fea- ture representation, classiﬁcation, and fusion. First, we utilize a ResNet back-end classiﬁer based on the introduced utterance- level deep learning framework....

work page 2019

[5] [5]

Robust text-independent speaker identiﬁcation using gaussian mixture speaker models,

D. A. Reynolds and R. C. Rose, “Robust text-independent speaker identiﬁcation using gaussian mixture speaker models,” IEEE Transactions on Speech & Audio Processing , vol. 3, no. 1, pp. 72–83, 1995

work page 1995

[6] [6]

An overview of text-independent speaker recognition: From features to supervectors,

T. Kinnunen and H. Li, “An overview of text-independent speaker recognition: From features to supervectors,” Speech Communica- tion, vol. 52, no. 1, pp. 12–40, 2010

work page 2010

[7] [7]

Speaker recognition by machines and humans: A tutorial review,

J. H. L. Hansen and T. Hasan, “Speaker recognition by machines and humans: A tutorial review,” IEEE Signal Processing Maga- zine, vol. 32, no. 6, pp. 74–99, 2015

work page 2015

[8] [8]

Spooﬁng and coun- termeasures for automatic speaker veriﬁcation,

N. Evans, T. Kinnunen, and J. Yamagishi, “Spooﬁng and coun- termeasures for automatic speaker veriﬁcation,” in Proc. INTER- SPEECH 2013, 2013

work page 2013

[9] [9]

Spooﬁng and countermeasures for speaker veriﬁcation: A sur- vey,

Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li, “Spooﬁng and countermeasures for speaker veriﬁcation: A sur- vey,”Speech Communication, vol. 66, pp. 130–153, 2015

work page 2015

[10] [10]

On the vulnerability of speaker veriﬁcation to realistic voice spooﬁng,

S. K. Ergunay, E. Khoury, A. Lazaridis, and S. Marcel, “On the vulnerability of speaker veriﬁcation to realistic voice spooﬁng,” in IEEE International Conference on Biometrics Theory, Applica- tions and Systems, 2015, pp. 1–6

work page 2015

[11] [11]

Asvspoof: the automatic speaker veriﬁcation spooﬁng and countermeasures challenge,

Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilci, M. Sahidullah, A. Sizov, N. Evans, M. Todisco, and H. Delgado, “Asvspoof: the automatic speaker veriﬁcation spooﬁng and countermeasures challenge,” IEEE Journal of Selected Topics in Signal Processing, vol. PP, no. 99, pp. 1–1, 2017

work page 2017

[12] [12]

Asvspoof 2015: the ﬁrst automatic speaker veriﬁ- cation spooﬁng and countermeasures challenge,

Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilc, S. M., and S. Aleksandr, “Asvspoof 2015: the ﬁrst automatic speaker veriﬁ- cation spooﬁng and countermeasures challenge,” inProc. INTER- SPEECH 2015, 2015

work page 2015

[13] [13]

Reddots replayed: A new replay spooﬁng attack corpus for text-dependent speaker veriﬁcation research,

T. Kinnunen, M. Sahidullah, M. Falcone, L. Costantini, R. G. Hautamki, D. Thomsen, A. Sarkar, Z. Tan, H. Delgado, and M. Todisco, “Reddots replayed: A new replay spooﬁng attack corpus for text-dependent speaker veriﬁcation research,” in Proc. ICCASP 2017, 2017

work page 2017

[14] [14]

The asvspoof 2017 challenge: As- sessing the limits of replay spooﬁng attack detection,

T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Yamagishi, and K. A. Lee, “The asvspoof 2017 challenge: As- sessing the limits of replay spooﬁng attack detection,” in Proc. Interspeech 2017, 2017, pp. 2–6

work page 2017

[15] [15]

Asvspoof 2017 version 2.0: meta-data analysis and baseline enhancements,

H. Delgado, M. Todisco, M. Sahidullah, N. Evans, T. Kin- nunen, K. A. Lee, and J. Yamagishi, “Asvspoof 2017 version 2.0: meta-data analysis and baseline enhancements,” in Proc. Speaker Odyssey 2018, 2018, pp. 296–303

work page 2017

[16] [16]

[Online]

Asvspoof 2019 evaluation plan. [Online]. Available: http://www. asvspoof.org/asvspoof2019/asvspoof2019 evaluation plan.pdf

work page 2019

[17] [17]

Classiﬁers for synthetic speech detection: A comparison,

C. Hanili, T. Kinnunen, M. Sahidullah, and A. Sizov, “Classiﬁers for synthetic speech detection: A comparison,” in Proc. INTER- SPEECH 2015, 2015

work page 2015

[18] [18]

A new feature for auto- matic speaker veriﬁcation anti-spooﬁng: Constant q cepstral co- efﬁcients,

M. Todisco, H. Delgado, and N. Evans, “A new feature for auto- matic speaker veriﬁcation anti-spooﬁng: Constant q cepstral co- efﬁcients,” in Proc. Speaker Odyssey 2016, 2016

work page 2016

[19] [19]

Calculation of a constant q spectral transform,

J. C. Brown, “Calculation of a constant q spectral transform,” Journal of the Acoustical Society of America , vol. 89, no. 1, pp. 425–434, 1991

work page 1991

[20] [20]

Constant q cepstral co- efﬁcients: A spooﬁng countermeasure for automatic speaker ver- iﬁcation,

M. Todisco, H. Delgado, and N. Evans, “Constant q cepstral co- efﬁcients: A spooﬁng countermeasure for automatic speaker ver- iﬁcation,” Computer Speech & Language, vol. 45, pp. 516 – 535, 2017

work page 2017

[21] [21]

An investigation of deep- learning frameworks for speaker veriﬁcation antispooﬁng,

C. Zhang, C. Yu, and J. H. L. Hansen, “An investigation of deep- learning frameworks for speaker veriﬁcation antispooﬁng,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 4, pp. 684–694, June 2017

work page 2017

[22] [22]

Audio replay attack detection with deep learning frameworks,

G. Lavrentyeva, S. Novoselov, E. Malykh, A. Kozlov, O. Kuda- shev, and V . Shchemelinin, “Audio replay attack detection with deep learning frameworks,” inProc. INTERSPEECH 2017, 2017, pp. 82–86

work page 2017

[23] [23]

End-To-End Audio Replay Attack Detection Using Deep Convolutional Networks with Attention,

F. Tom, M. Jain, and P. Dey, “End-To-End Audio Replay Attack Detection Using Deep Convolutional Networks with Attention,” in Proc. INTERSPEECH 2018, 2018, pp. 681–685

work page 2018

[24] [24]

Countermeasures for automatic speaker veriﬁcation replay spooﬁng attack : On data augmentation, feature representation, classiﬁcation and fusion,

W. Cai, D. Cai, W. Liu, G. Li, and M. Li, “Countermeasures for automatic speaker veriﬁcation replay spooﬁng attack : On data augmentation, feature representation, classiﬁcation and fusion,” in Proc. INTERSPEECH 2017, 2017, pp. 17–21

work page 2017

[25] [25]

Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,

W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,” in Proc. Speaker Odyssey, 2018, pp. 74–81

work page 2018

[26] [26]

Insights into end-to- end learning scheme for language identiﬁcation,

W. Cai, Z. Cai, W. Liu, X. Wang, and M. Li, “Insights into end-to- end learning scheme for language identiﬁcation,” inProc. ICASSP 2018, 2018, pp. 5209–5213

work page 2018

[27] [27]

Analysis of length normalization in end-to-end speaker veriﬁcation system,

W. Cai, J. Chen, and M. Li, “Analysis of length normalization in end-to-end speaker veriﬁcation system,” inProc. INTERSPEECH 2018, 2018, pp. 3618–3622

work page 2018

[28] [28]

Ima- genet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Ima- genet: A large-scale hierarchical image database,” in Proc. CVPR 2009, 2009, pp. 248–255

work page 2009

[29] [29]

Network in network,

M. Lin, Q. Chen, and S. Yan, “Network in network,” in Proc. ICLR 2014, 2014

work page 2014

[30] [30]

A comparison of features for synthetic speech detection,

M. Sahidullah, T. Kinnunen, and C. Hanili, “A comparison of features for synthetic speech detection,” in Proc. INTERSPEECH 2015, 2015, pp. 2087–2091

work page 2015

[31] [31]

The modiﬁed group delay func- tion and its application to phoneme recognition,

H. A. Murthy and V . Gadde, “The modiﬁed group delay func- tion and its application to phoneme recognition,” inProc. ICASSP 2003, vol. 1, 2003, pp. I–68

work page 2003

[32] [32]

Signiﬁcance of the Modiﬁed Group Delay Feature in Speech Recognition,

R. M. Hegde, H. A. Murthy, and V . R. R. Gadde, “Signiﬁcance of the Modiﬁed Group Delay Feature in Speech Recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 1, pp. 190–202

work page

[33] [33]

Syn- thetic speech detection using phase information,

I. Saratxaga, J. Sanchez, Z. Wu, I. Hernaez, and E. Navas, “Syn- thetic speech detection using phase information,”Speech Commu- nication, vol. 81, pp. 30 – 41, 2016

work page 2016

[34] [34]

On the usefulness of stft phase spectrum in human listening tests,

K. K. Paliwal and L. D. Alsteris, “On the usefulness of stft phase spectrum in human listening tests,”Speech Communication, vol. 45, no. 2, pp. 153 – 170, 2005

work page 2005

[35] [35]

Audio aug- mentation for speech recognition,

T. Ko, V . Peddinti, D. Povey, and S. Khudanpur, “Audio aug- mentation for speech recognition,” inProc. INTERSPEECH 2015, 2015, pp. 3586–3589

work page 2015

[36] [36]

t-dcf: a de- tection cost function for the tandem assessment of spooﬁng coun- termeasures and automatic speaker veriﬁcation,

T. Kinnunen, K. A. Lee, H. Delgado, N. Evans, M. Todisco, M. Sahidullah, J. Yamagishi, and D. A. Reynolds, “t-dcf: a de- tection cost function for the tandem assessment of spooﬁng coun- termeasures and automatic speaker veriﬁcation,” inProc. Speaker Odyssey 2018, 2018, pp. 312–319

work page 2018

[37] [37]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR 2016, 2016, pp. 770–778

work page 2016