AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge

Alexey Efimenko; Dmitrii Korzh; Grach Mkrtchian; Kirill Borodin; Mikhail Gorodnichev; Oleg Y. Rogov; Vasiliy Kudryavtsev

arxiv: 2408.17352 · v2 · pith:LRIV5ED3new · submitted 2024-08-30 · 💻 cs.SD · cs.AI· eess.AS

AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge

Kirill Borodin , Vasiliy Kudryavtsev , Dmitrii Korzh , Alexey Efimenko , Grach Mkrtchian , Mikhail Gorodnichev , Oleg Y. Rogov This is my paper

Pith reviewed 2026-05-23 21:30 UTC · model grok-4.3

classification 💻 cs.SD cs.AIeess.AS

keywords speech deepfake detectionAASISTKolmogorov-Arnold networksASVspoof 2024self-supervised learningvoice conversiontext-to-speechspeaker verification

0 comments

The pith

Enhancing AASIST with Kolmogorov-Arnold networks and additional components more than doubles performance on speech deepfake detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AASIST3 as an updated architecture for identifying synthetic speech that threatens automatic speaker verification systems used in authentication and fraud detection. It builds on the original AASIST by adding Kolmogorov-Arnold networks, extra layers, encoders, pre-emphasis filters, and self-supervised learning features to improve separation of real and generated audio. A sympathetic reader would see value in stronger defenses against text-to-speech and voice-conversion attacks that could compromise financial transactions or device access. The reported results show minDCF dropping to 0.5357 in closed conditions and 0.1414 in open conditions.

Core claim

By integrating Kolmogorov-Arnold networks into the AASIST framework along with additional layers, encoders, pre-emphasis techniques, and SSL features, AASIST3 achieves minDCF scores of 0.5357 in the closed condition and 0.1414 in the open condition on the ASVspoof 2024 Challenge, representing more than a twofold improvement in deepfake speech detection.

What carries the argument

The AASIST3 architecture that augments AASIST with Kolmogorov-Arnold networks for feature transformation plus regularization layers and pre-emphasis.

If this is right

Detection of synthetic voices improves by more than a factor of two in both closed and open conditions.
Automatic speaker verification systems gain stronger resistance to TTS and VC attacks.
The updated model becomes publicly available for deployment in authentication and forensic tasks.
The combination of SSL features with the new components supports better handling of unseen synthesis methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar Kolmogorov-Arnold network additions could be tested on other audio classification tasks such as speaker identification.
The approach may lower the amount of labeled data needed for training robust deepfake detectors if SSL features prove complementary.
Reproducing the gains on later ASVspoof editions or independent datasets would indicate whether the changes generalize beyond the 2024 challenge.

Load-bearing premise

The performance gains come from the listed architectural changes rather than differences in training data, optimization, or evaluation protocol.

What would settle it

A side-by-side run of the original AASIST and AASIST3 on identical training data and the same ASVspoof 2024 evaluation protocol that fails to show at least a twofold reduction in minDCF.

Figures

Figures reproduced from arXiv: 2408.17352 by Alexey Efimenko, Dmitrii Korzh, Grach Mkrtchian, Kirill Borodin, Mikhail Gorodnichev, Oleg Y. Rogov, Vasiliy Kudryavtsev.

**Figure 2.** Figure 2: The KAN-HS-GAL Operation. 3.5. Experiments with different KAN-based encoders As one of the key ideas of our model is the use of KAN, we chose to explore the potential of KAN-based encoders, including KAN+tokenization[42], ReluConvKAN and WavKANConv[43], both in stand-alone form and combined with a sharpness-aware minimisation[20] mechanism. When evaluated using a validation set with SAM, WavKanConv showed… view at source ↗

read the original abstract

Automatic Speaker Verification (ASV) systems, which identify speakers based on their voice characteristics, have numerous applications, such as user authentication in financial transactions, exclusive access control in smart devices, and forensic fraud detection. However, the advancement of deep learning algorithms has enabled the generation of synthetic audio through Text-to-Speech (TTS) and Voice Conversion (VC) systems, exposing ASV systems to potential vulnerabilities. To counteract this, we propose a novel architecture named AASIST3. By enhancing the existing AASIST framework with Kolmogorov-Arnold networks, additional layers, encoders, and pre-emphasis techniques, AASIST3 achieves a more than twofold improvement in performance. It demonstrates minDCF results of 0.5357 in the closed condition and 0.1414 in the open condition, significantly enhancing the detection of synthetic voices and improving ASV security. \textbf{The new version of the model is publicly available at \href{https://huggingface.co/lab260/Spectra-AASIST3}{\underline{HuggingFace (2026)}}}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AASIST3 reports better minDCF numbers on ASVspoof 2024 after adding KAN layers and a few other tweaks to the original AASIST, but supplies no ablations or matched baselines to show those changes caused the gains.

read the letter

The core of this paper is a direct extension of the AASIST architecture for audio deepfake detection. They insert Kolmogorov-Arnold network layers, add some extra encoders and layers, apply pre-emphasis, and release the resulting model. On the 2024 ASVspoof challenge it reaches minDCF of 0.5357 in the closed condition and 0.1414 in the open condition, which the abstract calls more than a twofold improvement. The model is also put on Hugging Face, which is useful for anyone who wants to try it directly. That is the new empirical result and the main practical contribution. The work stays within the established deepfake detection line rather than introducing a new framework or theoretical angle. The numbers are presented clearly enough for a challenge paper. The main weakness is the missing controls. The abstract states the performance claim but does not report the original AASIST result under identical training conditions, nor any ablation that isolates the KAN component, the extra layers, or the pre-emphasis. Without those, it is impossible to attribute the delta to the listed modifications instead of differences in data handling, optimization, or evaluation protocol. The paper does not appear to contain equations or formal derivations, so the circularity burden is low, but the empirical attribution remains the load-bearing part. This is mainly of interest to groups already running ASVspoof-style benchmarks or building production anti-spoofing systems. A reader who needs the latest numbers on that specific dataset and is willing to test the released model themselves can get value from it. It is coherent on its own terms and shows straightforward engagement with the task, so it clears the bar for serious refereeing even though the attribution gap would need to be addressed in revision.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes AASIST3, an enhancement of the AASIST architecture for speech deepfake detection on the ASVspoof 2024 Challenge. It augments the base model with Kolmogorov-Arnold Networks (KAN), additional layers, encoders, pre-emphasis, SSL features, and extra regularization, reporting minDCF scores of 0.5357 (closed condition) and 0.1414 (open condition) and claiming more than twofold improvement. The updated model is released publicly on Hugging Face.

Significance. If the reported gains can be shown to arise from the listed architectural changes rather than uncontrolled differences in training or evaluation, the result would strengthen practical defenses against synthetic speech in ASV systems. The public model release would also aid reproducibility.

major comments (2)

[Abstract] Abstract: The central claim of 'more than twofold improvement' is load-bearing yet unsupported; the abstract supplies neither the original AASIST minDCF under matched conditions nor any ablation isolating KAN, extra layers, encoders, or pre-emphasis from possible differences in data, optimizer, or protocol.
[Abstract] Abstract: No methodology details, baselines, error bars, or statistical significance tests accompany the reported minDCF values (0.5357 closed / 0.1414 open), preventing assessment of whether the numbers reliably support the improvement claim.

minor comments (1)

[Abstract] The Hugging Face link contains '(2026)', which appears to be a typographical error given the 2024 arXiv date.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the comments. We address each major comment below and will revise the abstract accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of 'more than twofold improvement' is load-bearing yet unsupported; the abstract supplies neither the original AASIST minDCF under matched conditions nor any ablation isolating KAN, extra layers, encoders, or pre-emphasis from possible differences in data, optimizer, or protocol.

Authors: We agree the abstract should support the claim with the original AASIST minDCF under matched ASVspoof 2024 conditions. The revised abstract will report that baseline value and reference an ablation study (to be added in the main text) isolating the contributions of KAN, additional layers, encoders, and pre-emphasis. This will demonstrate that gains stem from the listed changes rather than training or protocol differences. revision: yes
Referee: [Abstract] Abstract: No methodology details, baselines, error bars, or statistical significance tests accompany the reported minDCF values (0.5357 closed / 0.1414 open), preventing assessment of whether the numbers reliably support the improvement claim.

Authors: We acknowledge these elements are absent from the abstract. The revision will add concise methodology details and baselines to the abstract. Error bars and statistical tests will be included if multiple runs exist in our experiments; otherwise we will note the single-run results and retain the reported point estimates. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical architecture tweak with no derivations or self-referential predictions

full rationale

The paper presents an empirical model enhancement (AASIST3) by adding KAN, layers, encoders, and pre-emphasis to AASIST, then reports minDCF numbers on ASVspoof 2024. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim is a performance delta from architectural changes; while the skeptic correctly notes missing ablations, this is an attribution gap rather than circularity. The work is self-contained against external benchmarks and contains no self-definitional steps or reductions by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are described. Typical deep learning hyperparameters are implicitly present but unspecified.

axioms (1)

domain assumption The ASVspoof 2024 evaluation protocol and metrics (minDCF) provide a fair measure of real-world deepfake detection capability.
Invoked by reporting challenge scores as the primary evidence of improvement.

pith-pipeline@v0.9.0 · 5765 in / 1232 out tokens · 22990 ms · 2026-05-23T21:30:31.920919+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Interpreting Multi-Branch Anti-Spoofing Architectures: Correlating Internal Strategy with Empirical Performance
cs.SD 2026-02 unverdicted novelty 6.0

A framework using covariance-based spectral signatures and TreeSHAP attributions on AASIST3 branches identifies four operational archetypes and a flawed specialization mode that explains high error rates on specific s...

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Introduction Automatic Speaker Verification (ASV) systems are designed to identify speakers based on their voice characteristics. These systems have a variety of applications, including in the finan- cial sector for user authentication during transactions, in smart devices to ensure exclusive access for the owner to control their equipment, and in forensi...

work page
[2]

Preliminaries 2.1. Kolmogorov-Arnold Network The Kolmogorov-Arnold theorem [26, 27, 28] postulates that any continuous multivariate function f : [0 , 1]n → R can be represented as a finite composition of continuous unary func- tions and a binary addition operation. More specifically: f(x) = f(x1, x2, ..., xn) = 2nX q=0 Φq 2nX p=1 ϕq,p(xp), (1) where ϕq,p ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Description of final approaches For the Closed Condition, we considered our described AA- SIST3 model

Experiments and Results 3.1. Description of final approaches For the Closed Condition, we considered our described AA- SIST3 model. The model was constrained to accept only four seconds of audio as input, which proved insufficient for the models to achieve a deeper comprehension of the audio as a whole. To address this limitation, the audio was fed into t...

work page
[4]

This progress, however, has intro- duced a corresponding vulnerability in ASV systems, neces- sitating the development of a CM system to detect synthetic voices

Conclusion The rapid development of various deep learning algorithms has created new opportunities for generating synthetic audio us- ing TTS and VC systems. This progress, however, has intro- duced a corresponding vulnerability in ASV systems, neces- sitating the development of a CM system to detect synthetic voices. In this paper, we proposed a novel ar...

work page
[5]

3.1 AASIST3 0.2657 sec

Classic AASIST 0.5671 sec. 3.1 AASIST3 0.2657 sec. 3.2 Leaf instead of SincConv 0.52 sec. 3.2 F0 subband + SincConv 0.4406 sec. 3.2 F0 subband instead of SincConv 0.4225 sec. 3.2 SincConv + CQT 0.4083 sec. 3.3 AttentionAugmentedConv2d in encoder 0.4762 sec. 3.3 Augmentations without pre-emphasis 0.4495 sec. 3.3 All augmentations 0.3624 sec. 3.4 AReLU inst...

work page
[6]

Spoofing and countermeasures for automatic speaker verification,

Nicholas Evans, Tomi Kinnunen, and Junichi Yamagishi, “Spoofing and countermeasures for automatic speaker verification,” in INTERSPEECH 2013, 14th Annual Con- ference of the International Speech Communication As- sociation, August 25-29, 2013, Lyon, France , ISCA, Ed., Lyon, 2013

work page 2013
[7]

Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures chal- lenge,

Zhizheng Wu et al., “Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures chal- lenge,” in INTERSPEECH 2015 , ISCA, Ed., Dresden, 2015

work page 2015
[8]

The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection,

Tomi Kinnunen, Md. Sahidullah, H ´ector Delgado, et al., “The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection,” in Proc. Interspeech 2017, 2017, pp. 2–6

work page 2017
[9]

Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,

Xin Wang, Junichi Yamagishi, Massimiliano Todisco, et al., “Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,” 2020

work page 2019
[10]

Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild,

Xuechen Liu et al., “Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild,”IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, vol. 31, pp. 2507–2522, 2023

work page 2021
[11]

Singfake: Singing voice deepfake detection,

Yongyi Zang, You Zhang, Mojtaba Heydari, and Zhiyao Duan, “Singfake: Singing voice deepfake detection,” 2024

work page 2024
[12]

STC Antispoofing Systems for the ASVspoof2019 Challenge,

Galina Lavrentyeva, Sergey Novoselov, Andzhukaev Tseren, Marina V olkova, Artem Gorlanov, and Alexandr Kozlov, “STC Antispoofing Systems for the ASVspoof2019 Challenge,” in Proc. Interspeech 2019, 2019, pp. 1033–1037

work page 2019
[13]

Overlapped Frequency-Distributed Network: Frequency-Aware V oice Spoofing Countermeasure,

Sunmook Choi, Il-Youp Kwak, and Seungsang Oh, “Overlapped Frequency-Distributed Network: Frequency-Aware V oice Spoofing Countermeasure,” in Proc. Interspeech 2022, 2022, pp. 3558–3562

work page 2022
[14]

Deep Residual Neural Networks for Audio Spoofing De- tection,

Moustafa Alzantot, Ziqi Wang, and Mani B. Srivastava, “Deep Residual Neural Networks for Audio Spoofing De- tection,” in Proc. Interspeech 2019, 2019, pp. 1078–1082

work page 2019
[15]

ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual Networks,

Cheng-I Lai, Nanxin Chen, Jes ´us Villalba, and Najim De- hak, “ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual Networks,” inProc. Interspeech 2019, 2019, pp. 1013–1017

work page 2019
[16]

Speaker-Targeted Synthetic Speech Detection,

Diego Castan et al., “Speaker-Targeted Synthetic Speech Detection,” in Proc. The Speaker and Language Recogni- tion Workshop (Odyssey 2022), 2022, pp. 62–69

work page 2022
[17]

V oice spoofing detection through residual network, max feature map, and depthwise sep- arable convolution,

Il-Youp Kwak et al., “V oice spoofing detection through residual network, max feature map, and depthwise sep- arable convolution,” IEEE Access, vol. PP, pp. 1–1, 01 2023

work page 2023
[18]

UR Channel-Robust Synthetic Speech Detection System for ASVspoof 2021,

Xinhui Chen, You Zhang, Ge Zhu, and Zhiyao Duan, “UR Channel-Robust Synthetic Speech Detection System for ASVspoof 2021,” in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Chal- lenge, 2021, pp. 75–82

work page 2021
[19]

Attentional fusion tdnn for spoof speech detection,

Lei Wu and Ye Jiang, “Attentional fusion tdnn for spoof speech detection,” in 2022 5th International Conference on Pattern Recognition and Artificial Intelligence (PRAI), 2022, pp. 651–657

work page 2022
[20]

Spotnet: A spoofing- aware transformer network for effective synthetic speech detection,

Awais Khan and Khalid Malik, “Spotnet: A spoofing- aware transformer network for effective synthetic speech detection,” in 2nd ACM International Workshop on Mul- timedia AI against Disinformation (MAD’23), 06 2023

work page 2023
[21]

Aasist: Audio anti-spoofing us- ing integrated spectro-temporal graph attention networks,

Jee weon Jung et al., “Aasist: Audio anti-spoofing us- ing integrated spectro-temporal graph attention networks,” 2021

work page 2021
[22]

Improving short utterance anti-spoofing with aasist2,

Yuxiang Zhang, Jingze Lu, Zengqiang Shang, Wenchao Wang, and Pengyuan Zhang, “Improving short utterance anti-spoofing with aasist2,” 2024

work page 2024
[23]

Automatic speaker verification spoof- ing and deepfake detection using wav2vec 2.0 and data augmentation,

Hemlata Tak et al., “Automatic speaker verification spoof- ing and deepfake detection using wav2vec 2.0 and data augmentation,” 2022

work page 2022
[24]

Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectro- grams,

Penghui Wen et al., “Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectro- grams,” in Proc. INTERSPEECH 2023 , 2023, pp. 271– 275

work page 2023
[25]

Sharpness-aware minimization for efficiently improving generalization,

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur, “Sharpness-aware minimization for efficiently improving generalization,” 2021

work page 2021
[26]

Asam: Adaptive sharpness-aware min- imization for scale-invariant learning of deep neural net- works,

Jungmin Kwon, Jeongseop Kim, Hyunseo Park, and In Kwon Choi, “Asam: Adaptive sharpness-aware min- imization for scale-invariant learning of deep neural net- works,” 2021

work page 2021
[27]

General- ized fake audio detection via deep stable learning,

Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Yuankun Xie, Yukun Liu, Xiaopeng Wang, Xuefei Liu, Yongwei Li, Jianhua Tao, Yi Lu, Xin Qi, and Shuchen Shi, “General- ized fake audio detection via deep stable learning,” 2024

work page 2024
[28]

Beyond silence: Bias analysis through loss and asymmetric approach in audio anti-spoofing,

Hye jin Shim, Md Sahidullah, Jee weon Jung, Shinji Watanabe, and Tomi Kinnunen, “Beyond silence: Bias analysis through loss and asymmetric approach in audio anti-spoofing,” 2024

work page 2024
[29]

Samo: Speaker attractor multi-center one-class learning for voice anti-spoofing,

Siwen Ding, You Zhang, and Zhiyao Duan, “Samo: Speaker attractor multi-center one-class learning for voice anti-spoofing,” 2022

work page 2022
[30]

Kan: Kolmogorov-arnold networks,

Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Solja ˇci´c, Thomas Y . Hou, and Max Tegmark, “Kan: Kolmogorov-arnold networks,” 2024

work page 2024
[31]

On the Representation of continuous functions of several variables as superpositions of contin- uous functions of a smaller number of variables.,

A. N. Kolmogorov, “On the Representation of continuous functions of several variables as superpositions of contin- uous functions of a smaller number of variables.,” Dokl. Akad. Nauk, vol. 108, no. 2, 1956

work page 1956
[32]

On a constructive proof of Kolmogorov’s superposition theorem,

J ¨urgen Braun and Michael Griebel, “On a constructive proof of Kolmogorov’s superposition theorem,”Construc- tive approximation, vol. 30, pp. 653–675, 2009, Publisher: Springer

work page 2009
[33]

On the representation of continu- ous functions of many variables by superposition of con- tinuous functions of one variable and addition,

A. N. Kolmogorov, “On the representation of continu- ous functions of many variables by superposition of con- tinuous functions of one variable and addition,” in Dok- lady Akademii Nauk . 1957, vol. 114, pp. 953–956, Rus- sian Academy of Sciences

work page 1957
[34]

Speaker recogni- tion from raw waveform with sincnet,

Mirco Ravanelli and Yoshua Bengio, “Speaker recogni- tion from raw waveform with sincnet,” 2019

work page 2019
[35]

wav2vec 2.0: A framework for self- supervised learning of speech representations,

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli, “wav2vec 2.0: A framework for self- supervised learning of speech representations,” 2020

work page 2020
[36]

Attention is all you need,

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” Ad- vances in neural information processing systems, vol. 30, 2017

work page 2017
[37]

End-to- end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection,

Hemlata Tak, Jee weon Jung, Jose Patino, Madhu Kam- ble, Massimiliano Todisco, and Nicholas Evans, “End-to- end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection,” 2021

work page 2021
[38]

Focal loss for dense object detection,

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll ´ar, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988

work page 2017
[39]

Algorithmic foundation of deep x-risk optimization.arXiv preprint arXiv:2206.00439,

Tianbao Yang, “Algorithmic foundation of deep x-risk optimization,” arXiv preprint arXiv:2206.00439, 2022

work page arXiv 2022
[40]

Libauc: A deep learning library for x-risk optimization,

Zhuoning Yuan, Dixian Zhu, Zi-Hao Qiu, Gang Li, Xuan- hui Wang, and Tianbao Yang, “Libauc: A deep learning library for x-risk optimization,” in 29th SIGKDD Confer- ence on Knowledge Discovery and Data Mining, 2023

work page 2023
[41]

Spatial reconstructed local attention res2net with f0 subband for fake speech detection,

Cunhang Fan, Jun Xue, Jianhua Tao, Jiangyan Yi, Chen- glong Wang, Chengshi Zheng, and Zhao Lv, “Spatial reconstructed local attention res2net with f0 subband for fake speech detection,” 2023

work page 2023
[42]

Leaf: A learnable frontend for audio classification,

Neil Zeghidour, Olivier Teboul, F ´elix de Chau- mont Quitry, and Marco Tagliasacchi, “Leaf: A learnable frontend for audio classification,” 2021

work page 2021
[43]

Towards robust speech representa- tion learning for thousands of languages,

William Chen et al., “Towards robust speech representa- tion learning for thousands of languages,” arXiv preprint arXiv:2407.00837, 2024

work page arXiv 2024
[44]

Attention augmented convo- lutional networks,

Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V . Le, “Attention augmented convo- lutional networks,” 2019

work page 2019
[45]

Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,

Hemlata Tak, Madhu Kamble, Jose Patino, Massimiliano Todisco, and Nicholas Evans, “Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,” 2021

work page 2021
[46]

Arelu: Attention- based rectified linear unit,

Dengsheng Chen, Jun Li, and Kai Xu, “Arelu: Attention- based rectified linear unit,” 2020

work page 2020
[47]

U-kan makes strong backbone for med- ical image segmentation and generation,

Chenxin Li et al., “U-kan makes strong backbone for med- ical image segmentation and generation,” 2024

work page 2024
[48]

Kolmogorov-arnold convolutions: Design principles and empirical studies,

Ivan Drokin, “Kolmogorov-arnold convolutions: Design principles and empirical studies,” 2024

work page 2024
[49]

Squeeze-and-excitation networks,

Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu, “Squeeze-and-excitation networks,” 2017

work page 2017
[50]

Pushing the limits of raw waveform speaker recognition,

Jee weon Jung, You Jin Kim, Hee-Soo Heo, Bong-Jin Lee, Youngki Kwon, and Joon Son Chung, “Pushing the limits of raw waveform speaker recognition,” 2022

work page 2022
[51]

Wavenet: A generative model for raw audio,

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu, “Wavenet: A generative model for raw audio,” 2016

work page 2016
[52]

Ela: Efficient local attention for deep convolutional neural networks,

Wei Xu and Yi Wan, “Ela: Efficient local attention for deep convolutional neural networks,” 2024

work page 2024
[53]

Arcface: Additive angu- lar margin loss for deep face recognition,

Jiankang Deng, Jia Guo, Jing Yang, Niannan Xue, Irene Kotsia, and Stefanos Zafeiriou, “Arcface: Additive angu- lar margin loss for deep face recognition,” 2018

work page 2018
[54]

Convnext based neural network for audio anti-spoofing,

Qiaowei Ma, Jinghui Zhong, Yitao Yang, Weiheng Liu, Ying Gao, and Wing W.Y . Ng, “Convnext based neural network for audio anti-spoofing,” 2022

work page 2022

[1] [1]

Introduction Automatic Speaker Verification (ASV) systems are designed to identify speakers based on their voice characteristics. These systems have a variety of applications, including in the finan- cial sector for user authentication during transactions, in smart devices to ensure exclusive access for the owner to control their equipment, and in forensi...

work page

[2] [2]

Preliminaries 2.1. Kolmogorov-Arnold Network The Kolmogorov-Arnold theorem [26, 27, 28] postulates that any continuous multivariate function f : [0 , 1]n → R can be represented as a finite composition of continuous unary func- tions and a binary addition operation. More specifically: f(x) = f(x1, x2, ..., xn) = 2nX q=0 Φq 2nX p=1 ϕq,p(xp), (1) where ϕq,p ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Description of final approaches For the Closed Condition, we considered our described AA- SIST3 model

Experiments and Results 3.1. Description of final approaches For the Closed Condition, we considered our described AA- SIST3 model. The model was constrained to accept only four seconds of audio as input, which proved insufficient for the models to achieve a deeper comprehension of the audio as a whole. To address this limitation, the audio was fed into t...

work page

[4] [4]

This progress, however, has intro- duced a corresponding vulnerability in ASV systems, neces- sitating the development of a CM system to detect synthetic voices

Conclusion The rapid development of various deep learning algorithms has created new opportunities for generating synthetic audio us- ing TTS and VC systems. This progress, however, has intro- duced a corresponding vulnerability in ASV systems, neces- sitating the development of a CM system to detect synthetic voices. In this paper, we proposed a novel ar...

work page

[5] [5]

3.1 AASIST3 0.2657 sec

Classic AASIST 0.5671 sec. 3.1 AASIST3 0.2657 sec. 3.2 Leaf instead of SincConv 0.52 sec. 3.2 F0 subband + SincConv 0.4406 sec. 3.2 F0 subband instead of SincConv 0.4225 sec. 3.2 SincConv + CQT 0.4083 sec. 3.3 AttentionAugmentedConv2d in encoder 0.4762 sec. 3.3 Augmentations without pre-emphasis 0.4495 sec. 3.3 All augmentations 0.3624 sec. 3.4 AReLU inst...

work page

[6] [6]

Spoofing and countermeasures for automatic speaker verification,

Nicholas Evans, Tomi Kinnunen, and Junichi Yamagishi, “Spoofing and countermeasures for automatic speaker verification,” in INTERSPEECH 2013, 14th Annual Con- ference of the International Speech Communication As- sociation, August 25-29, 2013, Lyon, France , ISCA, Ed., Lyon, 2013

work page 2013

[7] [7]

Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures chal- lenge,

Zhizheng Wu et al., “Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures chal- lenge,” in INTERSPEECH 2015 , ISCA, Ed., Dresden, 2015

work page 2015

[8] [8]

The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection,

Tomi Kinnunen, Md. Sahidullah, H ´ector Delgado, et al., “The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection,” in Proc. Interspeech 2017, 2017, pp. 2–6

work page 2017

[9] [9]

Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,

Xin Wang, Junichi Yamagishi, Massimiliano Todisco, et al., “Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,” 2020

work page 2019

[10] [10]

Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild,

Xuechen Liu et al., “Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild,”IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, vol. 31, pp. 2507–2522, 2023

work page 2021

[11] [11]

Singfake: Singing voice deepfake detection,

Yongyi Zang, You Zhang, Mojtaba Heydari, and Zhiyao Duan, “Singfake: Singing voice deepfake detection,” 2024

work page 2024

[12] [12]

STC Antispoofing Systems for the ASVspoof2019 Challenge,

Galina Lavrentyeva, Sergey Novoselov, Andzhukaev Tseren, Marina V olkova, Artem Gorlanov, and Alexandr Kozlov, “STC Antispoofing Systems for the ASVspoof2019 Challenge,” in Proc. Interspeech 2019, 2019, pp. 1033–1037

work page 2019

[13] [13]

Overlapped Frequency-Distributed Network: Frequency-Aware V oice Spoofing Countermeasure,

Sunmook Choi, Il-Youp Kwak, and Seungsang Oh, “Overlapped Frequency-Distributed Network: Frequency-Aware V oice Spoofing Countermeasure,” in Proc. Interspeech 2022, 2022, pp. 3558–3562

work page 2022

[14] [14]

Deep Residual Neural Networks for Audio Spoofing De- tection,

Moustafa Alzantot, Ziqi Wang, and Mani B. Srivastava, “Deep Residual Neural Networks for Audio Spoofing De- tection,” in Proc. Interspeech 2019, 2019, pp. 1078–1082

work page 2019

[15] [15]

ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual Networks,

Cheng-I Lai, Nanxin Chen, Jes ´us Villalba, and Najim De- hak, “ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual Networks,” inProc. Interspeech 2019, 2019, pp. 1013–1017

work page 2019

[16] [16]

Speaker-Targeted Synthetic Speech Detection,

Diego Castan et al., “Speaker-Targeted Synthetic Speech Detection,” in Proc. The Speaker and Language Recogni- tion Workshop (Odyssey 2022), 2022, pp. 62–69

work page 2022

[17] [17]

V oice spoofing detection through residual network, max feature map, and depthwise sep- arable convolution,

Il-Youp Kwak et al., “V oice spoofing detection through residual network, max feature map, and depthwise sep- arable convolution,” IEEE Access, vol. PP, pp. 1–1, 01 2023

work page 2023

[18] [18]

UR Channel-Robust Synthetic Speech Detection System for ASVspoof 2021,

Xinhui Chen, You Zhang, Ge Zhu, and Zhiyao Duan, “UR Channel-Robust Synthetic Speech Detection System for ASVspoof 2021,” in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Chal- lenge, 2021, pp. 75–82

work page 2021

[19] [19]

Attentional fusion tdnn for spoof speech detection,

Lei Wu and Ye Jiang, “Attentional fusion tdnn for spoof speech detection,” in 2022 5th International Conference on Pattern Recognition and Artificial Intelligence (PRAI), 2022, pp. 651–657

work page 2022

[20] [20]

Spotnet: A spoofing- aware transformer network for effective synthetic speech detection,

Awais Khan and Khalid Malik, “Spotnet: A spoofing- aware transformer network for effective synthetic speech detection,” in 2nd ACM International Workshop on Mul- timedia AI against Disinformation (MAD’23), 06 2023

work page 2023

[21] [21]

Aasist: Audio anti-spoofing us- ing integrated spectro-temporal graph attention networks,

Jee weon Jung et al., “Aasist: Audio anti-spoofing us- ing integrated spectro-temporal graph attention networks,” 2021

work page 2021

[22] [22]

Improving short utterance anti-spoofing with aasist2,

Yuxiang Zhang, Jingze Lu, Zengqiang Shang, Wenchao Wang, and Pengyuan Zhang, “Improving short utterance anti-spoofing with aasist2,” 2024

work page 2024

[23] [23]

Automatic speaker verification spoof- ing and deepfake detection using wav2vec 2.0 and data augmentation,

Hemlata Tak et al., “Automatic speaker verification spoof- ing and deepfake detection using wav2vec 2.0 and data augmentation,” 2022

work page 2022

[24] [24]

Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectro- grams,

Penghui Wen et al., “Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectro- grams,” in Proc. INTERSPEECH 2023 , 2023, pp. 271– 275

work page 2023

[25] [25]

Sharpness-aware minimization for efficiently improving generalization,

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur, “Sharpness-aware minimization for efficiently improving generalization,” 2021

work page 2021

[26] [26]

Asam: Adaptive sharpness-aware min- imization for scale-invariant learning of deep neural net- works,

Jungmin Kwon, Jeongseop Kim, Hyunseo Park, and In Kwon Choi, “Asam: Adaptive sharpness-aware min- imization for scale-invariant learning of deep neural net- works,” 2021

work page 2021

[27] [27]

General- ized fake audio detection via deep stable learning,

Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Yuankun Xie, Yukun Liu, Xiaopeng Wang, Xuefei Liu, Yongwei Li, Jianhua Tao, Yi Lu, Xin Qi, and Shuchen Shi, “General- ized fake audio detection via deep stable learning,” 2024

work page 2024

[28] [28]

Beyond silence: Bias analysis through loss and asymmetric approach in audio anti-spoofing,

Hye jin Shim, Md Sahidullah, Jee weon Jung, Shinji Watanabe, and Tomi Kinnunen, “Beyond silence: Bias analysis through loss and asymmetric approach in audio anti-spoofing,” 2024

work page 2024

[29] [29]

Samo: Speaker attractor multi-center one-class learning for voice anti-spoofing,

Siwen Ding, You Zhang, and Zhiyao Duan, “Samo: Speaker attractor multi-center one-class learning for voice anti-spoofing,” 2022

work page 2022

[30] [30]

Kan: Kolmogorov-arnold networks,

Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Solja ˇci´c, Thomas Y . Hou, and Max Tegmark, “Kan: Kolmogorov-arnold networks,” 2024

work page 2024

[31] [31]

On the Representation of continuous functions of several variables as superpositions of contin- uous functions of a smaller number of variables.,

A. N. Kolmogorov, “On the Representation of continuous functions of several variables as superpositions of contin- uous functions of a smaller number of variables.,” Dokl. Akad. Nauk, vol. 108, no. 2, 1956

work page 1956

[32] [32]

On a constructive proof of Kolmogorov’s superposition theorem,

J ¨urgen Braun and Michael Griebel, “On a constructive proof of Kolmogorov’s superposition theorem,”Construc- tive approximation, vol. 30, pp. 653–675, 2009, Publisher: Springer

work page 2009

[33] [33]

On the representation of continu- ous functions of many variables by superposition of con- tinuous functions of one variable and addition,

A. N. Kolmogorov, “On the representation of continu- ous functions of many variables by superposition of con- tinuous functions of one variable and addition,” in Dok- lady Akademii Nauk . 1957, vol. 114, pp. 953–956, Rus- sian Academy of Sciences

work page 1957

[34] [34]

Speaker recogni- tion from raw waveform with sincnet,

Mirco Ravanelli and Yoshua Bengio, “Speaker recogni- tion from raw waveform with sincnet,” 2019

work page 2019

[35] [35]

wav2vec 2.0: A framework for self- supervised learning of speech representations,

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli, “wav2vec 2.0: A framework for self- supervised learning of speech representations,” 2020

work page 2020

[36] [36]

Attention is all you need,

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” Ad- vances in neural information processing systems, vol. 30, 2017

work page 2017

[37] [37]

End-to- end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection,

Hemlata Tak, Jee weon Jung, Jose Patino, Madhu Kam- ble, Massimiliano Todisco, and Nicholas Evans, “End-to- end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection,” 2021

work page 2021

[38] [38]

Focal loss for dense object detection,

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll ´ar, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988

work page 2017

[39] [39]

Algorithmic foundation of deep x-risk optimization.arXiv preprint arXiv:2206.00439,

Tianbao Yang, “Algorithmic foundation of deep x-risk optimization,” arXiv preprint arXiv:2206.00439, 2022

work page arXiv 2022

[40] [40]

Libauc: A deep learning library for x-risk optimization,

Zhuoning Yuan, Dixian Zhu, Zi-Hao Qiu, Gang Li, Xuan- hui Wang, and Tianbao Yang, “Libauc: A deep learning library for x-risk optimization,” in 29th SIGKDD Confer- ence on Knowledge Discovery and Data Mining, 2023

work page 2023

[41] [41]

Spatial reconstructed local attention res2net with f0 subband for fake speech detection,

Cunhang Fan, Jun Xue, Jianhua Tao, Jiangyan Yi, Chen- glong Wang, Chengshi Zheng, and Zhao Lv, “Spatial reconstructed local attention res2net with f0 subband for fake speech detection,” 2023

work page 2023

[42] [42]

Leaf: A learnable frontend for audio classification,

Neil Zeghidour, Olivier Teboul, F ´elix de Chau- mont Quitry, and Marco Tagliasacchi, “Leaf: A learnable frontend for audio classification,” 2021

work page 2021

[43] [43]

Towards robust speech representa- tion learning for thousands of languages,

William Chen et al., “Towards robust speech representa- tion learning for thousands of languages,” arXiv preprint arXiv:2407.00837, 2024

work page arXiv 2024

[44] [44]

Attention augmented convo- lutional networks,

Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V . Le, “Attention augmented convo- lutional networks,” 2019

work page 2019

[45] [45]

Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,

Hemlata Tak, Madhu Kamble, Jose Patino, Massimiliano Todisco, and Nicholas Evans, “Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,” 2021

work page 2021

[46] [46]

Arelu: Attention- based rectified linear unit,

Dengsheng Chen, Jun Li, and Kai Xu, “Arelu: Attention- based rectified linear unit,” 2020

work page 2020

[47] [47]

U-kan makes strong backbone for med- ical image segmentation and generation,

Chenxin Li et al., “U-kan makes strong backbone for med- ical image segmentation and generation,” 2024

work page 2024

[48] [48]

Kolmogorov-arnold convolutions: Design principles and empirical studies,

Ivan Drokin, “Kolmogorov-arnold convolutions: Design principles and empirical studies,” 2024

work page 2024

[49] [49]

Squeeze-and-excitation networks,

Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu, “Squeeze-and-excitation networks,” 2017

work page 2017

[50] [50]

Pushing the limits of raw waveform speaker recognition,

Jee weon Jung, You Jin Kim, Hee-Soo Heo, Bong-Jin Lee, Youngki Kwon, and Joon Son Chung, “Pushing the limits of raw waveform speaker recognition,” 2022

work page 2022

[51] [51]

Wavenet: A generative model for raw audio,

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu, “Wavenet: A generative model for raw audio,” 2016

work page 2016

[52] [52]

Ela: Efficient local attention for deep convolutional neural networks,

Wei Xu and Yi Wan, “Ela: Efficient local attention for deep convolutional neural networks,” 2024

work page 2024

[53] [53]

Arcface: Additive angu- lar margin loss for deep face recognition,

Jiankang Deng, Jia Guo, Jing Yang, Niannan Xue, Irene Kotsia, and Stefanos Zafeiriou, “Arcface: Additive angu- lar margin loss for deep face recognition,” 2018

work page 2018

[54] [54]

Convnext based neural network for audio anti-spoofing,

Qiaowei Ma, Jinghui Zhong, Yitao Yang, Weiheng Liu, Ying Gao, and Wing W.Y . Ng, “Convnext based neural network for audio anti-spoofing,” 2022

work page 2022