Audio Deepfake Detection with Half-Truth Localisation Using Cross-Attentive Feature Fusion
Pith reviewed 2026-06-29 06:00 UTC · model grok-4.3
The pith
A 576k-parameter model jointly classifies real, fully fake, and half-truth audio while regressing the boundaries of spliced segments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CAFNet fuses MFCC, LFCC, and Chroma-STFT features through parallel depthwise-separable convolution branches with cross-attention, followed by a BiLSTM regression head for boundary prediction. On the combined MLADDC T2+T3 test set it achieves 92.71 percent accuracy and macro AUC of 0.9910 for ternary classification, boundary localisation MAE of 0.075 seconds, and on binary detection 96.76 percent accuracy with 3.20 percent EER, outperforming fine-tuned XLS-R 300M and AST 87M at over 500 times fewer parameters.
What carries the argument
Cross-attentive feature fusion: parallel depthwise-separable convolution branches on MFCC, LFCC, and Chroma-STFT inputs with cross-attention between branches, plus BiLSTM regression head for joint classification and boundary regression.
If this is right
- Joint ternary classification and boundary regression can be performed in a single forward pass without separate models for detection and localisation.
- The small parameter count enables deployment in resource-constrained settings while still outperforming much larger pre-trained models on both tasks.
- Cross-dataset evaluation reveals that standard fine-tuning of large backbones leads to collapsed representations even with reduced learning rates.
- The reported median boundary error of 0.052 seconds provides a concrete target for localisation precision in half-truth scenarios.
Where Pith is reading between the lines
- The cross-attention fusion of multiple cepstral features could be adapted to localise manipulations in other sequential signals such as video or sensor data.
- Lightweight models of this form may support real-time monitoring in applications like voice authentication or social media audio moderation.
- The ternary output and boundary regression together could support downstream tasks such as automated editing or forensic timeline reconstruction.
- Testing the architecture on datasets with multiple spliced segments or varying splice lengths would clarify the limits of the current boundary regression head.
Load-bearing premise
The MLADDC T2+T3 test partitions contain held-out half-truth examples whose generation process and acoustic conditions are representative of real-world partial manipulations and were not seen during any hyperparameter search or model selection.
What would settle it
Evaluating CAFNet on a new collection of real-world spliced audio recordings generated by different synthesis methods and recording conditions than those in MLADDC would test whether the reported accuracy and localisation error hold.
Figures
read the original abstract
Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such half-truth audio requires not only distinguishing it from real and fully fake speech, but also localising where the manipulation occurs. We present CAFNet, a 576k-parameter architecture that addresses both tasks jointly: it performs ternary classification (real, fully-fake, or half-truth) and regresses the temporal boundaries of the synthesised region in a single forward pass. CAFNet fuses Mel-Frequency Cepstral Coefficient (MFCC), Linear-Frequency Cepstral Coefficient (LFCC), and Chroma Short-Time Fourier Transform (Chroma-STFT) features through parallel depthwise-separable convolution branches with cross-attention, followed by a Bidirectional Long Short-Term Memory (BiLSTM) regression head for boundary prediction. On the combined Multi-Lingual Audio Deepfake Detection Corpus (MLADDC) T2+T3 test set, CAFNet achieves 92.71% accuracy and macro Area Under the Curve (AUC) of 0.9910, with boundary localisation Mean Absolute Error (MAE) of 0.075s and a median error of 0.052s. On binary detection, it achieves 96.76% accuracy and 3.20% Equal Error Rate (EER), outperforming fine-tuned XLS-R 300M (78.31%) and AST 87M (93.03%) at over 500 times fewer parameters. A cross-dataset study further shows that standard fine-tuning collapses cross-domain representations even under reduced backbone learning rates.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CAFNet, a 576k-parameter model for joint ternary classification (real, fully-fake, half-truth) and temporal boundary regression on partially manipulated ('half-truth') audio. It fuses MFCC, LFCC, and Chroma-STFT features via parallel depthwise-separable convolution branches with cross-attention, followed by a BiLSTM regression head. On the combined MLADDC T2+T3 test set the model reports 92.71% ternary accuracy and 0.9910 macro AUC, 0.075 s boundary MAE (median 0.052 s), and for binary detection 96.76% accuracy with 3.20% EER, outperforming fine-tuned XLS-R 300M and AST 87M at >500× fewer parameters. A cross-dataset study on representation collapse under fine-tuning is also presented.
Significance. If the performance numbers are obtained on genuinely held-out half-truth examples whose synthesis pipelines and acoustic conditions were never seen during training or model selection, the result would be significant: it shows that a compact, multi-task architecture can simultaneously solve detection and localisation for a more realistic threat model while remaining deployable on edge devices. The explicit parameter count and joint-training formulation are concrete strengths.
major comments (2)
- [§4] §4 (Experiments), data-partition paragraph: the construction of the MLADDC T2 and T3 test partitions is not described (no statement on speaker overlap, vocoder overlap, recording conditions, or whether any half-truth examples were used in hyper-parameter search). Because the headline metrics (92.71% ternary accuracy, 0.9910 macro AUC, 3.20% EER) are only interpretable under a strict held-out regime, this omission is load-bearing for the central claim.
- [§3, §4.2] §3 (Model) and §4.2 (Training protocol): it is not stated whether boundary regression is trained jointly with the ternary classifier or post-hoc, nor which loss combination (cross-entropy + L1/L2 on boundaries) is used. The reported 0.075 s MAE cannot be assessed without this information, which directly affects the joint-task claim.
minor comments (2)
- [§4.3] The abstract and §4.3 state that CAFNet 'outperforms' XLS-R and AST but do not report the exact fine-tuning protocol (learning-rate schedule, number of epochs, data augmentation) used for the baselines; adding a short table would improve reproducibility.
- [Figure 2] Figure 2 (architecture diagram) uses an inconsistent arrow style for the cross-attention blocks; a single legend would aid clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The two major comments highlight important omissions in the experimental description that affect interpretability of the results. We address each point below and will revise the manuscript to incorporate the requested clarifications.
read point-by-point responses
-
Referee: [§4] §4 (Experiments), data-partition paragraph: the construction of the MLADDC T2 and T3 test partitions is not described (no statement on speaker overlap, vocoder overlap, recording conditions, or whether any half-truth examples were used in hyper-parameter search). Because the headline metrics (92.71% ternary accuracy, 0.9910 macro AUC, 3.20% EER) are only interpretable under a strict held-out regime, this omission is load-bearing for the central claim.
Authors: We agree that the data-partition details are insufficiently described and that this information is necessary to support the held-out evaluation claims. The MLADDC T2 and T3 partitions were constructed with no speaker overlap with the training set, distinct vocoders, and different recording conditions from the training data; no half-truth examples were included in hyper-parameter search or validation. We will expand the data-partition paragraph in §4 to explicitly document these properties, citing the original MLADDC dataset splits. revision: yes
-
Referee: [§3, §4.2] §3 (Model) and §4.2 (Training protocol): it is not stated whether boundary regression is trained jointly with the ternary classifier or post-hoc, nor which loss combination (cross-entropy + L1/L2 on boundaries) is used. The reported 0.075 s MAE cannot be assessed without this information, which directly affects the joint-task claim.
Authors: The boundary regression is trained jointly with the ternary classifier in a single multi-task objective. The combined loss is cross-entropy for the three-class classification head plus L1 loss on the boundary regression head, with equal weighting between the two terms. We will revise §3 (Model) and §4.2 (Training protocol) to state this joint training procedure and loss formulation explicitly. revision: yes
Circularity Check
No circularity; results are empirical test-set metrics with no derivation chain
full rationale
The paper reports empirical accuracies, AUC, MAE, and EER on the MLADDC T2+T3 test partitions for a 576k-parameter model CAFNet, with direct comparisons to fine-tuned XLS-R and AST. No equations, first-principles derivations, or predictions are claimed that reduce to fitted inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear. The work is self-contained against external benchmarks and receives the default non-finding for an empirical ML paper.
Axiom & Free-Parameter Ledger
free parameters (1)
- CAFNet weights and training hyperparameters
axioms (1)
- domain assumption MLADDC T2+T3 test examples are representative of real-world half-truth manipulations and were not used in model selection
Reference graph
Works this paper leans on
-
[1]
J. Kong, J. Kim, J. Bae, HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural Inf. Process. Syst.33, 17022–17033 (2020). https://doi.org/10.48550/arXiv.2010.05646
- [2]
-
[3]
K.T. Mai, S. Bray, T. Davies, L.D. Griffin, Warning: humans cannot reliably detect speech deepfakes. PLoS ONE18(8), e0285333 (2023).https://doi.org/10.1371/journal. pone.0285333
-
[4]
Shah, R.M
A.J. Shah, R.M. Purohit, D.H. Vaghera, H.A. Patil, MLADDC: multi-lingual audio deep- fake detection corpus, inAudio Imagination: NeurIPS 2024 Workshop(2024).https: //openreview.net/forum?id=ic3HvoOTeU
2024
-
[5]
Karthik S. Krishnan, Koushik S. Krishnan, MFAAN: unveiling audio deepfakes with a multi-feature authenticity network, inProc. 9th Int. Conf. Signal Process. Commun. (ICSC), pp. 585–590 (2023).https://doi.org/10.1109/ICSC60394.2023.10441405
-
[6]
J. Khochare, C. Joshi, B. Yenarkar, S. Suratkar, F. Kazi, A deep learning framework for audio deepfake detection. Arab. J. Sci. Eng.47(3), 3447–3458 (2022).https://doi.org/ 10.1007/s13369-021-06297-w
-
[7]
A. Hamza, A.R. Javed, F. Iqbal, N. Kryvinska, A.S. Almadhor, Z. Jalil, R. Borghol, Deepfake audio detection via MFCC features using machine learning. IEEE Access10, 134018–134028 (2022).https://doi.org/10.1109/ACCESS.2022.3231480
-
[8]
Reimao, V
R. Reimao, V. Tzerpos, FoR: a dataset for synthetic speech detection, inProc. 2019 Int. Conf. Speech Technol. Human-Comput. Dialogue (SpeD), pp. 1–10. IEEE (2019)
2019
- [9]
-
[10]
J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K.A. Lee, T. Kinnunen, N. Evans, H. Delgado, ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection, inProc. ASVspoof 2021 Workshop, pp. 47–58 (2021).https://doi.org/10.21437/ASVSPOOF.2021-8
-
[11]
M¨ uller, P
N.M. M¨ uller, P. Czempin, F. Diekmann, A. Froghyar, K. B¨ ottinger, Does audio deepfake detection generalize? inProc. Interspeech 2022, pp. 2783–2787 (2022)
2022
-
[12]
J. Yi, Y. Bai, J. Tao, H. Ma, Z. Tian, C. Wang, T. Wang, R. Fu, Half-truth: a partially fake audio detection dataset, inProc. Interspeech 2021, pp. 1654–1658 (2021).https: //doi.org/10.21437/Interspeech.2021-930
-
[13]
A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino, A. Baevski, A. Conneau, M. Auli, XLS-R: self-supervised cross-lingual speech representation learning at scale, inProc. Interspeech 2022, pp. 2278–2282 (2022)
2022
-
[14]
Y. Gong, Y.-A. Chung, J. Glass, AST: audio spectrogram transformer, inProc. Interspeech 2021, pp. 571–575 (2021).https://doi.org/10.21437/Interspeech.2021-698
-
[15]
A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst.33, 12449–12460 (2020). https://doi.org/10.48550/arXiv.2006.11477
-
[16]
Parisi, R
G.I. Parisi, R. Kemker, J.L. Part, C. Kanan, S. Wermter, Continual lifelong learning with neural networks: a review. Neural Netw.113, 54–71 (2019).https://doi.org/10.1016/ j.neunet.2019.01.012
2019
-
[17]
J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A.A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, R. Hadsell, Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. 114(13), 3521–3526 (2017).https://doi.org/10.1073/pnas.1611835114
-
[18]
Z. Cai, M. Li, Integrating frame-level boundary detection and deepfake detection for locat- ing manipulated regions in partially spoofed audio forgery attacks. Comput. Speech Lang. 85, 101597 (2024).https://doi.org/10.1016/j.csl.2023.101597 13
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.