pith. sign in

arxiv: 2606.10908 · v1 · pith:WELB3CLXnew · submitted 2026-06-09 · 💻 cs.SD · cs.AI· cs.CR· cs.LG

RAT: Reference-Augmented Training for ASV Anti-Spoofing

Pith reviewed 2026-06-27 11:40 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.CRcs.LG
keywords anti-spoofingASVspoofreference-augmented trainingdeepfake detectionspeaker verificationinvariancecountermeasure
0
0 comments X

The pith

Training anti-spoofing detectors with speaker reference recordings improves performance even when the reference is absent or ignored at inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an architecture that conditions spoof detection on additional speaker-reference recordings during training. It finds that the model quickly learns to ignore the reference channel, yet the training process itself creates invariance that raises detection accuracy on unseen deepfakes. This invariance holds when the reference is replaced by a zero vector or omitted entirely at test time. The result is a single model that reaches 2.57 percent EER and 0.074 minDCF on the ASVspoof 5 benchmark, exceeding prior ensemble systems.

Core claim

Conditioning a spoofing countermeasure on speaker-reference recordings during training causes the optimization to rapidly diminish the reference contributions, yielding a detector whose inference is largely independent of the reference channel yet still benefits from the invariance induced by reference-augmented training.

What carries the argument

Reference-Augmented Training (RAT), the strategy of supplying speaker-reference recordings as an additional input channel only during training.

If this is right

  • A single detector trained under RAT surpasses large ensemble systems on the ASVspoof 5 benchmark.
  • Detection accuracy remains high when the reference input is replaced by a zero vector at inference.
  • The optimization process quickly reduces dependence on the reference channel.
  • The same invariance benefit appears even when the reference recording is mismatched or absent during testing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Reference signals supplied only at training time may act as a form of regularization that encourages robustness to spoofing artifacts.
  • The approach could be tested on other audio classification tasks where auxiliary conditioning during training improves generalization without added inference cost.
  • If the invariance mechanism generalizes, similar reference-augmented schedules might reduce the need for matched reference data at deployment in related verification systems.

Load-bearing premise

The observed gains arise because reference-augmented training induces useful invariance rather than from unrelated differences in model size, data augmentation, or optimization settings.

What would settle it

An ablation that trains an otherwise identical model without any reference channel and obtains the same 2.57 percent EER and 0.074 minDCF on ASVspoof 5.

Figures

Figures reproduced from arXiv: 2606.10908 by Anton Firc, Jakub Re\v{s}, Kamil Malinka, Vojt\v{e}ch Stan\v{e}k.

Figure 1
Figure 1. Figure 1: Proposed RAT architecture with the Reference-Informed Block (RIB), mean pooling, and a downstream MLP classifier. eters as our base feature extractor, shared for both the refer￾ence and test recordings. For each time frame, it extracts a 1024-dimensional feature vector. We utilize features from all 24 transformer layers of XLS-R, i.e., Href = XLS-R(x s ref) ∈ R 24×tref×1024 Htest = XLS-R(x s test) ∈ R 24×t… view at source ↗
Figure 2
Figure 2. Figure 2: Training dynamics under zero and noise reference substitutions. Left: Normalized delta margin (∆mn) and the attention-to￾MLP branch ratio (rb). Right: Normalized activation deltas (∆hn) for the RIB block (ZRIB) and pooled embeddings (e). 5.1. Functional Dependence on Reference First, we quantify the functional impact of the reference input on the model’s output. We calculate a margin: m(x i ref, x i test) … view at source ↗
read the original abstract

We introduce a spoofing countermeasure architecture conditioned on speaker-reference recordings, but observe that it converges to a solution that effectively ignores the reference during inference. Surprisingly, training with a reference channel induces invariance that improves deepfake detection, even when the reference is absent or mismatched during inference. Based on this observation, we propose a Reference-Augmented Training (RAT) strategy. RAT yields improved detection performance compared to single-utterance baselines, even when the reference recording is replaced with a zero vector at inference. Through rigorous analysis, we demonstrate that the optimization process rapidly diminishes the reference contributions, leading to inference largely independent of the reference channel. Using RAT, we achieve state-of-the-art 2.57% EER and 0.074 minDCF on the ASVspoof 5 benchmark with a single detector, surpassing even large ensemble systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Reference-Augmented Training (RAT) for ASV anti-spoofing. A detector is conditioned on speaker-reference recordings during training, yet optimization rapidly diminishes the reference contribution so that inference is effectively reference-independent. The central claim is that this training procedure induces invariance to the reference channel, yielding improved deepfake detection even when the reference is replaced by a zero vector or mismatched at inference. RAT is reported to outperform single-utterance baselines and to reach state-of-the-art 2.57% EER and 0.074 minDCF on the ASVspoof 5 benchmark with a single detector, surpassing large ensembles.

Significance. If the performance gains are causally attributable to reference-augmented training rather than confounding differences in architecture or optimization, the method supplies a training-only modification that improves countermeasures without altering inference cost or architecture. The reported analysis of reference-contribution decay, if isolated from other factors, would constitute a useful empirical observation about how auxiliary conditioning can regularize spoofing detectors.

major comments (2)
  1. [Abstract] Abstract: the claim that RAT induces invariance responsible for the gains requires an ablation that holds model architecture, loss, data pipeline, and optimizer schedule fixed while toggling only the presence of the reference channel during training. The abstract states that optimization diminishes reference contributions but does not report such an isolation experiment, leaving the attribution to RAT unproven.
  2. [Results] Results (benchmark numbers): the reported 2.57% EER and 0.074 minDCF are presented without error bars, number of runs, or statistical tests against the single-utterance baseline, which is necessary to substantiate the claim of surpassing ensemble systems on ASVspoof 5.
minor comments (2)
  1. [Abstract] Abstract: the term 'rigorous analysis' of reference-contribution decay is used without indicating the quantitative method (e.g., gradient norms, ablation on reference masking, or contribution metrics) employed.
  2. [Methods] Notation: the precise mechanism by which the reference recording is fused into the detector (concatenation, cross-attention, etc.) should be stated with an equation in the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and describe the revisions we will implement to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that RAT induces invariance responsible for the gains requires an ablation that holds model architecture, loss, data pipeline, and optimizer schedule fixed while toggling only the presence of the reference channel during training. The abstract states that optimization diminishes reference contributions but does not report such an isolation experiment, leaving the attribution to RAT unproven.

    Authors: We agree that an explicit isolation experiment would provide stronger causal evidence for the role of reference-augmented training. While the manuscript already includes analysis showing rapid decay of reference contributions during optimization, this does not fully substitute for a controlled ablation that differs solely in the presence of the reference channel. We will add this ablation (training identical models with and without the reference input, all other factors fixed) and report the resulting performance differences in the revised manuscript. revision: yes

  2. Referee: [Results] Results (benchmark numbers): the reported 2.57% EER and 0.074 minDCF are presented without error bars, number of runs, or statistical tests against the single-utterance baseline, which is necessary to substantiate the claim of surpassing ensemble systems on ASVspoof 5.

    Authors: We acknowledge the importance of statistical rigor when claiming improvements over baselines and ensembles. The current results reflect single-run evaluations. In the revision we will repeat all experiments across multiple random seeds, report means and standard deviations, and include statistical significance tests comparing RAT to the single-utterance baseline. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark result stands on its own

full rationale

The paper introduces RAT as a training strategy and reports concrete EER/minDCF numbers on the fixed ASVspoof 5 benchmark. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would make the reported performance reduce to the inputs by construction. The central claim is an observed empirical improvement, not a self-referential derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the full set of modeling choices, loss functions, and data preprocessing steps cannot be audited. The claim rests on the validity of the ASVspoof 5 benchmark and standard deep-learning optimization assumptions.

axioms (1)
  • domain assumption The ASVspoof 5 benchmark and its official evaluation protocol (EER, minDCF) constitute a fair and representative test of anti-spoofing performance.
    All reported state-of-the-art claims are measured against this benchmark.

pith-pipeline@v0.9.1-grok · 5693 in / 1482 out tokens · 35180 ms · 2026-06-27T11:40:42.659624+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 1 linked inside Pith

  1. [1]

    Introduction Automatic Speaker Verification (ASV) systems face increas- ing threats from sophisticated spoofing attacks [1], including deepfakes created by text-to-speech synthesis and voice conver- sion [2, 3]. Traditional spoofing countermeasures (CMs) oper- ate on single test utterances [4] without leveraging additional speaker-specific information tha...

  2. [2]

    Background 2.1. ASV spoofing countermeasures Current anti-spoofing systems utilize pretrained Self-supervised learning (SSL) models such as Wav2Vec2 [10] or WavLM [11] due to their ability to extract rich speaker representations. These rich features are further processed and pooled, most commonly by Graph Attention Networks [12] from the AASIST frame- wor...

  3. [3]

    Architecture & Methodology Our architecture consists of three main components: an SSL feature extractor, a reference-informed block, and a downstream classifier, presented in Figure 1. SSL Feature Extractor: We use a pre-trained XLS-R [24] model based on the Wav2Vec2 architecture with 300M param- arXiv:2606.10908v1 [cs.SD] 9 Jun 2026 Reference-Informed Bl...

  4. [4]

    Results We first evaluate the detection performance of the proposed Reference-Augmented Training (RAT) strategy against single- utterance baselines. We compare our method (≈328Mparam- eters) with the best single system reported in the available lit- erature: WavLM + Hybrid Pruning (≈86M) [28], as well as best models from the ASVspoof 5 challenge: best sin...

  5. [5]

    To measure the reference re- liance, we replace the referencex i ref with a perturbed reference ˜ri during evaluation after each epoch

    Analysis of Training Dynamics To understandhowthe model converges to this reference- invariant solution, we analyzed the model’s and the RIB block’s internal dynamics during training. To measure the reference re- liance, we replace the referencex i ref with a perturbed reference ˜ri during evaluation after each epoch. We explored two op- tions of˜ri, that...

  6. [6]

    Conclusion We proposed Reference-Augmented Training (RAT), a train- ing paradigm for ASV anti-spoofing that conditions the model on reference speaker recordings. By employing a Reference- Informed Block (RIB) with cross-attention, we discovered that the network utilizes the reference during early training as a corrective signal to better isolate spoofing ...

  7. [7]

    Computational resources were provided by the e-INFRA CZ project (ID:90254), supported by the Ministry of Education, Youth and Sports of the Czech Re- public

    Acknowledgments This work was supported by the Brno University of Technology internal project FIT-S-26-9011. Computational resources were provided by the e-INFRA CZ project (ID:90254), supported by the Ministry of Education, Youth and Sports of the Czech Re- public

  8. [8]

    The authors reviewed and edited the output as needed and take full respon- sibility for the publication’s content

    Generative AI Use Disclosure During the preparation of this work, the authors used Generative AI Models (specifically Google Gemini, ChatGPT, and Gram- marly) for language editing and text refinement. The authors reviewed and edited the output as needed and take full respon- sibility for the publication’s content

  9. [9]

    Diffuse or Confuse: A Dif- fusion Deepfake Speech Dataset,

    A. Firc, K. Malinka, and P. Han ´aˇcek, “Diffuse or Confuse: A Dif- fusion Deepfake Speech Dataset,” in2024 International Confer- ence of the Biometrics Special Interest Group (BIOSIG). IEEE, Sep. 2024, p. 1–7

  10. [10]

    The dawn of a text-dependent society: deepfakes as a threat to speech verification systems,

    A. Firc and K. Malinka, “The dawn of a text-dependent society: deepfakes as a threat to speech verification systems,” inProceed- ings of the 37th ACM/SIGAPP Symposium on Applied Computing, ser. SAC ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 1646–1655

  11. [11]

    Resilience of V oice Assistants to Synthetic Speech,

    K. Malinkaet al., “Resilience of V oice Assistants to Synthetic Speech,” inComputer Security – ESORICS 2024. Cham: Springer Nature Switzerland, 2024, pp. 66–84

  12. [12]

    Evaluation framework for deepfake speech detection: a comparative study of state-of-the-art deepfake speech detectors,

    A. Firc, K. Malinka, and P. Han ´aˇcek, “Evaluation framework for deepfake speech detection: a comparative study of state-of-the-art deepfake speech detectors,”Cybersecurity, vol. 8, no. 1, 2025

  13. [13]

    Tandem Assessment of Spoofing Counter- measures and Automatic Speaker Verification: Fundamentals,

    T. Kinnunen, H. Delgado, N. Evans, K. A. Lee, V . Vestman, A. Nautsch, M. Todisco, X. Wang, M. Sahidullah, J. Yamagishi, and D. A. Reynolds, “Tandem Assessment of Spoofing Counter- measures and Automatic Speaker Verification: Fundamentals,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 28, pp. 2195–2210, 2020

  14. [14]

    Deepfake Speech De- tection: A Spectrogram Analysis,

    A. Firc, K. Malinka, and P. Han ´aˇcek, “Deepfake Speech De- tection: A Spectrogram Analysis,” inProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing, ser. SAC ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 1312–1320

  15. [15]

    SASV 2022: The First Spoofing-Aware Speaker Verification Challenge,

    J.-w. Jung, H. Tak, H.-j. Shim, H.-S. Heo, B.-J. Lee, S.-W. Chung, H.-J. Yu, N. Evans, and T. Kinnunen, “SASV 2022: The First Spoofing-Aware Speaker Verification Challenge,” inProc. Inter- speech (submitted), 2022

  16. [16]

    Differential Anomaly Detec- tion for Facial Images,

    M. Ibsen, L. J. Gonzalez-Soler, C. Rathgeb, P. Drozdowski, M. Gomez-Barrero, and C. Busch, “Differential Anomaly Detec- tion for Facial Images,” in2021 IEEE International Workshop on Information Forensics and Security (WIFS). IEEE, 2021, pp. 1–6

  17. [17]

    Speaker- Aware Anti-spoofing,

    X. Liu, M. Sahidullah, K. A. Lee, and T. Kinnunen, “Speaker- Aware Anti-spoofing,” inInterspeech 2023, 2023, pp. 2498–2502

  18. [18]

    SZU-AFS anti- spoofing system for the ASVspoof 5 Challenge,

    Y . Xu, J. Zhong, S. Zheng, Z. Liu, and B. Li, “SZU-AFS anti- spoofing system for the ASVspoof 5 Challenge,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop, 2024, pp. 64–71

  19. [19]

    Exploring WavLM back-ends for speech spoofing and deepfake detection,

    T. Stourbe, V . Miara, T. Lepage, and R. Dehak, “Exploring WavLM back-ends for speech spoofing and deepfake detection,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop, 2024, pp. 72–78

  20. [20]

    Graph Attention Networks for Anti-Spoofing,

    H. Tak, J. weon Jung, J. Patino, M. Todisco, and N. Evans, “Graph Attention Networks for Anti-Spoofing,” inProc. Inter- speech 2021, 2021, pp. 2356–2360

  21. [21]

    AASIST: Audio Anti-Spoofing Us- ing Integrated Spectro-Temporal Graph Attention Networks,

    J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “AASIST: Audio Anti-Spoofing Us- ing Integrated Spectro-Temporal Graph Attention Networks,” in ICASSP 2022 - 2022 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2022, pp. 6367– 6371

  22. [22]

    A single end-to-end voice anti-spoofing model with graph attention and feature aggregation for ASVspoof 5 Chal- lenge,

    W. Xiaet al., “A single end-to-end voice anti-spoofing model with graph attention and feature aggregation for ASVspoof 5 Chal- lenge,” inThe Automatic Speaker Verification Spoofing Counter- measures Workshop, 2024

  23. [23]

    SCDF: A Speaker Characteristics Deepfake Speech Dataset for Bias Analysis,

    V . Stanˇek, K. Srna, A. Firc, and K. Malinka, “SCDF: A Speaker Characteristics Deepfake Speech Dataset for Bias Analysis,” in BIOSIG 2025. Gesellschaft f ¨ur Informatik e.V ., 2025

  24. [24]

    BUT systems and analyses for the ASVspoof 5 Challenge,

    J. Rohdin, L. Zhang, P. Old ˇrich, V . Stanˇek, D. Mihola, J. Peng, T. Stafylakis, D. Beveraki, A. Silnova, J. Brukner, and L. Bur- get, “BUT systems and analyses for the ASVspoof 5 Challenge,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 24–31

  25. [25]

    Exploring generalization to unseen au- dio data for spoofing: insights from SSL models,

    A. Kulkarni, H. M. Tran, A. Kulkarni, S. Dowerah, D. Lo- live, and M. M. Doss, “Exploring generalization to unseen au- dio data for spoofing: insights from SSL models,” inThe Auto- matic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 86–93

  26. [26]

    ASVspoof 5 Challenge: advanced ResNet architectures for robust voice spoofing detec- tion,

    A.-T. Dao, M. Rouvier, and D. Matrouf, “ASVspoof 5 Challenge: advanced ResNet architectures for robust voice spoofing detec- tion,” inThe Automatic Speaker Verification Spoofing Counter- measures Workshop, 2024, pp. 163–169

  27. [27]

    Enhancing spoofing detection in ASVspoof 5 Workshop 2024: fusion of WavLM- ResNet18-SA for optimal performance against speech deepfakes,

    P.-C. Chan, W.-Y . Chen, and J.-C. Wang, “Enhancing spoofing detection in ASVspoof 5 Workshop 2024: fusion of WavLM- ResNet18-SA for optimal performance against speech deepfakes,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop, 2024, pp. 158–162

  28. [28]

    Evolutionary Multi-Objective Fusion of Deepfake Speech Detectors,

    V . Stan ˇek, M. Pere ˇs´ıni, L. Sekanina, A. Firc, and K. Malinka, “Evolutionary Multi-Objective Fusion of Deepfake Speech Detectors,” 2026. [Online]. Available: https://arxiv.org/abs/2604. 01330

  29. [29]

    Deepfakes as a threat to a speaker and facial recognition: An overview of tools and attack vectors,

    A. Firc, K. Malinka, and P. Han ´aˇcek, “Deepfakes as a threat to a speaker and facial recognition: An overview of tools and attack vectors,”Heliyon, vol. 9, no. 4, p. e15090, 2023

  30. [30]

    ASVspoof 5: crowd- sourced speech data, deepfakes, and adversarial attacks at scale,

    X. Wang, H. Delgado, H. Tak, J. weon Jung, H. jin Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. H. Kinnunen, N. Evans, K. A. Lee, and J. Yamagishi, “ASVspoof 5: crowd- sourced speech data, deepfakes, and adversarial attacks at scale,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 1–8

  31. [31]

    Revisiting and Improving Scoring Fusion for Spoofing-aware Speaker Verification Using Compositional Data Analysis,

    X. Wang, T. Kinnunen, K. A. Lee, P.-G. No ´e, and J. Yamagishi, “Revisiting and Improving Scoring Fusion for Spoofing-aware Speaker Verification Using Compositional Data Analysis,” inIn- terspeech 2024, 2024, pp. 1110–1114

  32. [32]

    XLS-R: Self-supervised Cross-lingual Speech Rep- resentation Learning at Scale,

    A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y . Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli, “XLS-R: Self-supervised Cross-lingual Speech Rep- resentation Learning at Scale,” inInterspeech 2022, 2022, pp. 2278–2282

  33. [33]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems, I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish- wanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017

  34. [34]

    Malafide: a novel adversarial convolutive noise attack against deepfake and spoofing detection systems,

    M. Panariello, W. Ge, H. Tak, M. Todisco, and N. Evans, “Malafide: a novel adversarial convolutive noise attack against deepfake and spoofing detection systems,” inInterspeech 2023, 2023, pp. 2868–2872

  35. [35]

    Malacopula: adversarial automatic speaker ver- ification attacks using a neural-based generalised Hammerstein model,

    M. Todisco, M. Panariello, X. Wang, H. Delgado, K. A. Lee, and N. Evans, “Malacopula: adversarial automatic speaker ver- ification attacks using a neural-based generalised Hammerstein model,” inThe Automatic Speaker Verification Spoofing Counter- measures Workshop (ASVspoof 2024), 2024, pp. 94–100

  36. [36]

    Hybrid pruning: In-situ compression of self-supervised speech models for speaker verification and anti-spoofing,

    J. Peng, L. Zhang, J. Han, O. Plchot, J. Rohdin, T. Stafylakis, S. Wang, and J. ˇCernock´y, “Hybrid pruning: In-situ compression of self-supervised speech models for speaker verification and anti-spoofing,” 2025. [Online]. Available: https://arxiv.org/abs/ 2508.16232

  37. [37]

    Learn from real: reality defender’s submission to ASVspoof5 Challenge,

    Y . Zhu, C. Goel, S. Koppisetti, T. Tran, A. Kumar, and G. Bharaj, “Learn from real: reality defender’s submission to ASVspoof5 Challenge,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 116– 123

  38. [38]

    Raw- boost: A Raw Data Boosting and Augmentation Method Applied to Automatic Speaker Verification Anti-Spoofing,

    H. Tak, M. Kamble, J. Patino, M. Todisco, and N. Evans, “Raw- boost: A Raw Data Boosting and Augmentation Method Applied to Automatic Speaker Verification Anti-Spoofing,” inICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6382–6386

  39. [39]

    USTC-KXDIGIT system description for ASVspoof5 Challenge,

    Y . Chen, H. Wu, N. Jiang, X. Xia, Q. Gu, Y . Hao, P. Cai, Y . Guan, J. Wang, W.-L. Xie, L. Fang, S. Fang, Y . Song, W. Guo, L. Liu, and M. Xu, “USTC-KXDIGIT system description for ASVspoof5 Challenge,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 109–115