AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge
Pith reviewed 2026-05-23 21:30 UTC · model grok-4.3
The pith
Enhancing AASIST with Kolmogorov-Arnold networks and additional components more than doubles performance on speech deepfake detection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By integrating Kolmogorov-Arnold networks into the AASIST framework along with additional layers, encoders, pre-emphasis techniques, and SSL features, AASIST3 achieves minDCF scores of 0.5357 in the closed condition and 0.1414 in the open condition on the ASVspoof 2024 Challenge, representing more than a twofold improvement in deepfake speech detection.
What carries the argument
The AASIST3 architecture that augments AASIST with Kolmogorov-Arnold networks for feature transformation plus regularization layers and pre-emphasis.
If this is right
- Detection of synthetic voices improves by more than a factor of two in both closed and open conditions.
- Automatic speaker verification systems gain stronger resistance to TTS and VC attacks.
- The updated model becomes publicly available for deployment in authentication and forensic tasks.
- The combination of SSL features with the new components supports better handling of unseen synthesis methods.
Where Pith is reading between the lines
- Similar Kolmogorov-Arnold network additions could be tested on other audio classification tasks such as speaker identification.
- The approach may lower the amount of labeled data needed for training robust deepfake detectors if SSL features prove complementary.
- Reproducing the gains on later ASVspoof editions or independent datasets would indicate whether the changes generalize beyond the 2024 challenge.
Load-bearing premise
The performance gains come from the listed architectural changes rather than differences in training data, optimization, or evaluation protocol.
What would settle it
A side-by-side run of the original AASIST and AASIST3 on identical training data and the same ASVspoof 2024 evaluation protocol that fails to show at least a twofold reduction in minDCF.
Figures
read the original abstract
Automatic Speaker Verification (ASV) systems, which identify speakers based on their voice characteristics, have numerous applications, such as user authentication in financial transactions, exclusive access control in smart devices, and forensic fraud detection. However, the advancement of deep learning algorithms has enabled the generation of synthetic audio through Text-to-Speech (TTS) and Voice Conversion (VC) systems, exposing ASV systems to potential vulnerabilities. To counteract this, we propose a novel architecture named AASIST3. By enhancing the existing AASIST framework with Kolmogorov-Arnold networks, additional layers, encoders, and pre-emphasis techniques, AASIST3 achieves a more than twofold improvement in performance. It demonstrates minDCF results of 0.5357 in the closed condition and 0.1414 in the open condition, significantly enhancing the detection of synthetic voices and improving ASV security. \textbf{The new version of the model is publicly available at \href{https://huggingface.co/lab260/Spectra-AASIST3}{\underline{HuggingFace (2026)}}}
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes AASIST3, an enhancement of the AASIST architecture for speech deepfake detection on the ASVspoof 2024 Challenge. It augments the base model with Kolmogorov-Arnold Networks (KAN), additional layers, encoders, pre-emphasis, SSL features, and extra regularization, reporting minDCF scores of 0.5357 (closed condition) and 0.1414 (open condition) and claiming more than twofold improvement. The updated model is released publicly on Hugging Face.
Significance. If the reported gains can be shown to arise from the listed architectural changes rather than uncontrolled differences in training or evaluation, the result would strengthen practical defenses against synthetic speech in ASV systems. The public model release would also aid reproducibility.
major comments (2)
- [Abstract] Abstract: The central claim of 'more than twofold improvement' is load-bearing yet unsupported; the abstract supplies neither the original AASIST minDCF under matched conditions nor any ablation isolating KAN, extra layers, encoders, or pre-emphasis from possible differences in data, optimizer, or protocol.
- [Abstract] Abstract: No methodology details, baselines, error bars, or statistical significance tests accompany the reported minDCF values (0.5357 closed / 0.1414 open), preventing assessment of whether the numbers reliably support the improvement claim.
minor comments (1)
- [Abstract] The Hugging Face link contains '(2026)', which appears to be a typographical error given the 2024 arXiv date.
Simulated Author's Rebuttal
We thank the referee for the comments. We address each major comment below and will revise the abstract accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of 'more than twofold improvement' is load-bearing yet unsupported; the abstract supplies neither the original AASIST minDCF under matched conditions nor any ablation isolating KAN, extra layers, encoders, or pre-emphasis from possible differences in data, optimizer, or protocol.
Authors: We agree the abstract should support the claim with the original AASIST minDCF under matched ASVspoof 2024 conditions. The revised abstract will report that baseline value and reference an ablation study (to be added in the main text) isolating the contributions of KAN, additional layers, encoders, and pre-emphasis. This will demonstrate that gains stem from the listed changes rather than training or protocol differences. revision: yes
-
Referee: [Abstract] Abstract: No methodology details, baselines, error bars, or statistical significance tests accompany the reported minDCF values (0.5357 closed / 0.1414 open), preventing assessment of whether the numbers reliably support the improvement claim.
Authors: We acknowledge these elements are absent from the abstract. The revision will add concise methodology details and baselines to the abstract. Error bars and statistical tests will be included if multiple runs exist in our experiments; otherwise we will note the single-run results and retain the reported point estimates. revision: partial
Circularity Check
No circularity: purely empirical architecture tweak with no derivations or self-referential predictions
full rationale
The paper presents an empirical model enhancement (AASIST3) by adding KAN, layers, encoders, and pre-emphasis to AASIST, then reports minDCF numbers on ASVspoof 2024. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim is a performance delta from architectural changes; while the skeptic correctly notes missing ablations, this is an attribution gap rather than circularity. The work is self-contained against external benchmarks and contains no self-definitional steps or reductions by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The ASVspoof 2024 evaluation protocol and metrics (minDCF) provide a fair measure of real-world deepfake detection capability.
Forward citations
Cited by 1 Pith paper
-
Interpreting Multi-Branch Anti-Spoofing Architectures: Correlating Internal Strategy with Empirical Performance
A framework using covariance-based spectral signatures and TreeSHAP attributions on AASIST3 branches identifies four operational archetypes and a flawed specialization mode that explains high error rates on specific s...
Reference graph
Works this paper leans on
-
[1]
Introduction Automatic Speaker Verification (ASV) systems are designed to identify speakers based on their voice characteristics. These systems have a variety of applications, including in the finan- cial sector for user authentication during transactions, in smart devices to ensure exclusive access for the owner to control their equipment, and in forensi...
-
[2]
Preliminaries 2.1. Kolmogorov-Arnold Network The Kolmogorov-Arnold theorem [26, 27, 28] postulates that any continuous multivariate function f : [0 , 1]n → R can be represented as a finite composition of continuous unary func- tions and a binary addition operation. More specifically: f(x) = f(x1, x2, ..., xn) = 2nX q=0 Φq 2nX p=1 ϕq,p(xp), (1) where ϕq,p ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Experiments and Results 3.1. Description of final approaches For the Closed Condition, we considered our described AA- SIST3 model. The model was constrained to accept only four seconds of audio as input, which proved insufficient for the models to achieve a deeper comprehension of the audio as a whole. To address this limitation, the audio was fed into t...
-
[4]
Conclusion The rapid development of various deep learning algorithms has created new opportunities for generating synthetic audio us- ing TTS and VC systems. This progress, however, has intro- duced a corresponding vulnerability in ASV systems, neces- sitating the development of a CM system to detect synthetic voices. In this paper, we proposed a novel ar...
-
[5]
Classic AASIST 0.5671 sec. 3.1 AASIST3 0.2657 sec. 3.2 Leaf instead of SincConv 0.52 sec. 3.2 F0 subband + SincConv 0.4406 sec. 3.2 F0 subband instead of SincConv 0.4225 sec. 3.2 SincConv + CQT 0.4083 sec. 3.3 AttentionAugmentedConv2d in encoder 0.4762 sec. 3.3 Augmentations without pre-emphasis 0.4495 sec. 3.3 All augmentations 0.3624 sec. 3.4 AReLU inst...
-
[6]
Spoofing and countermeasures for automatic speaker verification,
Nicholas Evans, Tomi Kinnunen, and Junichi Yamagishi, “Spoofing and countermeasures for automatic speaker verification,” in INTERSPEECH 2013, 14th Annual Con- ference of the International Speech Communication As- sociation, August 25-29, 2013, Lyon, France , ISCA, Ed., Lyon, 2013
work page 2013
-
[7]
Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures chal- lenge,
Zhizheng Wu et al., “Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures chal- lenge,” in INTERSPEECH 2015 , ISCA, Ed., Dresden, 2015
work page 2015
-
[8]
The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection,
Tomi Kinnunen, Md. Sahidullah, H ´ector Delgado, et al., “The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection,” in Proc. Interspeech 2017, 2017, pp. 2–6
work page 2017
-
[9]
Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,
Xin Wang, Junichi Yamagishi, Massimiliano Todisco, et al., “Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,” 2020
work page 2019
-
[10]
Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild,
Xuechen Liu et al., “Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild,”IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, vol. 31, pp. 2507–2522, 2023
work page 2021
-
[11]
Singfake: Singing voice deepfake detection,
Yongyi Zang, You Zhang, Mojtaba Heydari, and Zhiyao Duan, “Singfake: Singing voice deepfake detection,” 2024
work page 2024
-
[12]
STC Antispoofing Systems for the ASVspoof2019 Challenge,
Galina Lavrentyeva, Sergey Novoselov, Andzhukaev Tseren, Marina V olkova, Artem Gorlanov, and Alexandr Kozlov, “STC Antispoofing Systems for the ASVspoof2019 Challenge,” in Proc. Interspeech 2019, 2019, pp. 1033–1037
work page 2019
-
[13]
Overlapped Frequency-Distributed Network: Frequency-Aware V oice Spoofing Countermeasure,
Sunmook Choi, Il-Youp Kwak, and Seungsang Oh, “Overlapped Frequency-Distributed Network: Frequency-Aware V oice Spoofing Countermeasure,” in Proc. Interspeech 2022, 2022, pp. 3558–3562
work page 2022
-
[14]
Deep Residual Neural Networks for Audio Spoofing De- tection,
Moustafa Alzantot, Ziqi Wang, and Mani B. Srivastava, “Deep Residual Neural Networks for Audio Spoofing De- tection,” in Proc. Interspeech 2019, 2019, pp. 1078–1082
work page 2019
-
[15]
ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual Networks,
Cheng-I Lai, Nanxin Chen, Jes ´us Villalba, and Najim De- hak, “ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual Networks,” inProc. Interspeech 2019, 2019, pp. 1013–1017
work page 2019
-
[16]
Speaker-Targeted Synthetic Speech Detection,
Diego Castan et al., “Speaker-Targeted Synthetic Speech Detection,” in Proc. The Speaker and Language Recogni- tion Workshop (Odyssey 2022), 2022, pp. 62–69
work page 2022
-
[17]
Il-Youp Kwak et al., “V oice spoofing detection through residual network, max feature map, and depthwise sep- arable convolution,” IEEE Access, vol. PP, pp. 1–1, 01 2023
work page 2023
-
[18]
UR Channel-Robust Synthetic Speech Detection System for ASVspoof 2021,
Xinhui Chen, You Zhang, Ge Zhu, and Zhiyao Duan, “UR Channel-Robust Synthetic Speech Detection System for ASVspoof 2021,” in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Chal- lenge, 2021, pp. 75–82
work page 2021
-
[19]
Attentional fusion tdnn for spoof speech detection,
Lei Wu and Ye Jiang, “Attentional fusion tdnn for spoof speech detection,” in 2022 5th International Conference on Pattern Recognition and Artificial Intelligence (PRAI), 2022, pp. 651–657
work page 2022
-
[20]
Spotnet: A spoofing- aware transformer network for effective synthetic speech detection,
Awais Khan and Khalid Malik, “Spotnet: A spoofing- aware transformer network for effective synthetic speech detection,” in 2nd ACM International Workshop on Mul- timedia AI against Disinformation (MAD’23), 06 2023
work page 2023
-
[21]
Aasist: Audio anti-spoofing us- ing integrated spectro-temporal graph attention networks,
Jee weon Jung et al., “Aasist: Audio anti-spoofing us- ing integrated spectro-temporal graph attention networks,” 2021
work page 2021
-
[22]
Improving short utterance anti-spoofing with aasist2,
Yuxiang Zhang, Jingze Lu, Zengqiang Shang, Wenchao Wang, and Pengyuan Zhang, “Improving short utterance anti-spoofing with aasist2,” 2024
work page 2024
-
[23]
Hemlata Tak et al., “Automatic speaker verification spoof- ing and deepfake detection using wav2vec 2.0 and data augmentation,” 2022
work page 2022
-
[24]
Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectro- grams,
Penghui Wen et al., “Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectro- grams,” in Proc. INTERSPEECH 2023 , 2023, pp. 271– 275
work page 2023
-
[25]
Sharpness-aware minimization for efficiently improving generalization,
Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur, “Sharpness-aware minimization for efficiently improving generalization,” 2021
work page 2021
-
[26]
Jungmin Kwon, Jeongseop Kim, Hyunseo Park, and In Kwon Choi, “Asam: Adaptive sharpness-aware min- imization for scale-invariant learning of deep neural net- works,” 2021
work page 2021
-
[27]
General- ized fake audio detection via deep stable learning,
Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Yuankun Xie, Yukun Liu, Xiaopeng Wang, Xuefei Liu, Yongwei Li, Jianhua Tao, Yi Lu, Xin Qi, and Shuchen Shi, “General- ized fake audio detection via deep stable learning,” 2024
work page 2024
-
[28]
Beyond silence: Bias analysis through loss and asymmetric approach in audio anti-spoofing,
Hye jin Shim, Md Sahidullah, Jee weon Jung, Shinji Watanabe, and Tomi Kinnunen, “Beyond silence: Bias analysis through loss and asymmetric approach in audio anti-spoofing,” 2024
work page 2024
-
[29]
Samo: Speaker attractor multi-center one-class learning for voice anti-spoofing,
Siwen Ding, You Zhang, and Zhiyao Duan, “Samo: Speaker attractor multi-center one-class learning for voice anti-spoofing,” 2022
work page 2022
-
[30]
Kan: Kolmogorov-arnold networks,
Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Solja ˇci´c, Thomas Y . Hou, and Max Tegmark, “Kan: Kolmogorov-arnold networks,” 2024
work page 2024
-
[31]
A. N. Kolmogorov, “On the Representation of continuous functions of several variables as superpositions of contin- uous functions of a smaller number of variables.,” Dokl. Akad. Nauk, vol. 108, no. 2, 1956
work page 1956
-
[32]
On a constructive proof of Kolmogorov’s superposition theorem,
J ¨urgen Braun and Michael Griebel, “On a constructive proof of Kolmogorov’s superposition theorem,”Construc- tive approximation, vol. 30, pp. 653–675, 2009, Publisher: Springer
work page 2009
-
[33]
A. N. Kolmogorov, “On the representation of continu- ous functions of many variables by superposition of con- tinuous functions of one variable and addition,” in Dok- lady Akademii Nauk . 1957, vol. 114, pp. 953–956, Rus- sian Academy of Sciences
work page 1957
-
[34]
Speaker recogni- tion from raw waveform with sincnet,
Mirco Ravanelli and Yoshua Bengio, “Speaker recogni- tion from raw waveform with sincnet,” 2019
work page 2019
-
[35]
wav2vec 2.0: A framework for self- supervised learning of speech representations,
Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli, “wav2vec 2.0: A framework for self- supervised learning of speech representations,” 2020
work page 2020
-
[36]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” Ad- vances in neural information processing systems, vol. 30, 2017
work page 2017
-
[37]
Hemlata Tak, Jee weon Jung, Jose Patino, Madhu Kam- ble, Massimiliano Todisco, and Nicholas Evans, “End-to- end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection,” 2021
work page 2021
-
[38]
Focal loss for dense object detection,
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll ´ar, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988
work page 2017
-
[39]
Algorithmic foundation of deep x-risk optimization.arXiv preprint arXiv:2206.00439,
Tianbao Yang, “Algorithmic foundation of deep x-risk optimization,” arXiv preprint arXiv:2206.00439, 2022
-
[40]
Libauc: A deep learning library for x-risk optimization,
Zhuoning Yuan, Dixian Zhu, Zi-Hao Qiu, Gang Li, Xuan- hui Wang, and Tianbao Yang, “Libauc: A deep learning library for x-risk optimization,” in 29th SIGKDD Confer- ence on Knowledge Discovery and Data Mining, 2023
work page 2023
-
[41]
Spatial reconstructed local attention res2net with f0 subband for fake speech detection,
Cunhang Fan, Jun Xue, Jianhua Tao, Jiangyan Yi, Chen- glong Wang, Chengshi Zheng, and Zhao Lv, “Spatial reconstructed local attention res2net with f0 subband for fake speech detection,” 2023
work page 2023
-
[42]
Leaf: A learnable frontend for audio classification,
Neil Zeghidour, Olivier Teboul, F ´elix de Chau- mont Quitry, and Marco Tagliasacchi, “Leaf: A learnable frontend for audio classification,” 2021
work page 2021
-
[43]
Towards robust speech representa- tion learning for thousands of languages,
William Chen et al., “Towards robust speech representa- tion learning for thousands of languages,” arXiv preprint arXiv:2407.00837, 2024
-
[44]
Attention augmented convo- lutional networks,
Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V . Le, “Attention augmented convo- lutional networks,” 2019
work page 2019
-
[45]
Hemlata Tak, Madhu Kamble, Jose Patino, Massimiliano Todisco, and Nicholas Evans, “Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,” 2021
work page 2021
-
[46]
Arelu: Attention- based rectified linear unit,
Dengsheng Chen, Jun Li, and Kai Xu, “Arelu: Attention- based rectified linear unit,” 2020
work page 2020
-
[47]
U-kan makes strong backbone for med- ical image segmentation and generation,
Chenxin Li et al., “U-kan makes strong backbone for med- ical image segmentation and generation,” 2024
work page 2024
-
[48]
Kolmogorov-arnold convolutions: Design principles and empirical studies,
Ivan Drokin, “Kolmogorov-arnold convolutions: Design principles and empirical studies,” 2024
work page 2024
-
[49]
Squeeze-and-excitation networks,
Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu, “Squeeze-and-excitation networks,” 2017
work page 2017
-
[50]
Pushing the limits of raw waveform speaker recognition,
Jee weon Jung, You Jin Kim, Hee-Soo Heo, Bong-Jin Lee, Youngki Kwon, and Joon Son Chung, “Pushing the limits of raw waveform speaker recognition,” 2022
work page 2022
-
[51]
Wavenet: A generative model for raw audio,
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu, “Wavenet: A generative model for raw audio,” 2016
work page 2016
-
[52]
Ela: Efficient local attention for deep convolutional neural networks,
Wei Xu and Yi Wan, “Ela: Efficient local attention for deep convolutional neural networks,” 2024
work page 2024
-
[53]
Arcface: Additive angu- lar margin loss for deep face recognition,
Jiankang Deng, Jia Guo, Jing Yang, Niannan Xue, Irene Kotsia, and Stefanos Zafeiriou, “Arcface: Additive angu- lar margin loss for deep face recognition,” 2018
work page 2018
-
[54]
Convnext based neural network for audio anti-spoofing,
Qiaowei Ma, Jinghui Zhong, Yitao Yang, Weiheng Liu, Ying Gao, and Wing W.Y . Ng, “Convnext based neural network for audio anti-spoofing,” 2022
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.